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Preface 


Overview 


Welcome to the VIS Instruction Set User’s Guide. This book presents information 
about the VIS Instruction Set, which is an extension to the SPARC-V9 instruction 
set. The manual describes: 


* Introduction to the UltraSPARC I/II architecture 
* UltraSPARC III Instruction Set Extensions 

* VIS development environment 

* VIS instructions 


* Select examples, illustrating the use of VIS to process multimedia data 


How to Use This Book 


This book is provided with the UltraSPARC developers kit and provides you 
with a complete definition of the VIS instructions with some illustrative code ex- 
amples. Since the examples given include some assembly code, you should refer 
to The SPARC Architecture Manual, Version 9, and The UltraSPARC Users Manual 
for a more complete explanation of the concepts presented. 


Textual Conventions 


Fonts are used as follows: 
© italic font is used to refer to variables in text. 


* Typewriter font is used for code examples and function names. 





* Bold font is used for emphasis. 


Content of Chapters 


The VIS User’s Guide is designed to introduce you to the VIS Instruction Set, to 
permit you to write image processing, graphics or other applications for the UI- 
traSPARC processor. 


* Chapter 1, “Introduction,” provides a high level overview of the UltraSPARC 
superscalar processor and the performance advantages of the VIS Instruction 
Set. 


* Chapter 2, “UltraSPARC Concepts,” describes the hardware features of the 
UItraSPARC that account for the substantial performance enhancement and 
UItraSPARC III instruction set extensions. 


* Chapter 3, "Developing VIS Applications,” describes the applications 
development process, including descriptions of how to build 32-bit VIS 1.0 
and VIS 2.0 applications, 64-bit VIS 1.0 and VIS 2.0 applications. 


* Chapter 4, "VIS Instructions," introduces you to VIS, and includes simple 
examples of instruction use. 


* Chapter 5, "Code Examples," provides example programs taken from the 
applications areas of imaging, graphics, audio and video. 


* Chapter 6, "Improving Performance," presents helpful hints and suggestions 
to consider when writing code for the UltraSPARC. 
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Introduction 1 


1.1 Chapter Overview 


This chapter provides a brief introduction to the UltraSPARC I/II superscalar 
processor with special emphasis on the VIS Instruction Set. Topics included in 
this chapter are: 


* Description of UltraSPARC I/II. 


* Introduction to the VIS Instruction Set. 


1.2 UltraSPARC I/II 


UltraSPARC I/II is a highly integrated superscalar processor implementing the 
64-bit SPARC-V9 RISC architecture. The major performance features of the proces- 
sor are the capability to sustain an execution rate of four instructions per cycle even 
in the presence of conditional branches and cache misses at a high clock rate. 


UltraSPARC I/II supports 64-bit virtual addresses and integer data sizes up to 64 
bits while preserving compatibility with code written for the 32-bit SPARC V8 
processors. Of major significance is the incorporation of 16 additional double-pre- 
cision floating-point registers, bringing the total up to 32. 


The Floating-point unit (FPU) data paths have been enhanced to include the ca- 

pability to perform partitioned integer arithmetic operations required for graph- 
ics applications. This capability is provided by a graphics adder that is organized 
as four independent 16-bit adders, a graphics multiplier that is composed of four 
8x16 multipliers and a pixel distance logic implementation. A graphics status reg- 
ister (GSR) with scale factor and align offset fields is included to support format 
conversions and memory alignment. 





The arithmetic is performed on two new partitioned data types: pixel and fixed 
data. Pixels consist of four 8-bit unsigned integers contained in a 32-bit word. The 
vis_pdist() instruction accepts eight 8-bit unsigned integers in a 64-bit register. 
Fixed data consists of either four 16-bit fixed point components or two 32-bit 
fixed point components both contained in a 64-bit word, or either or the follow- 
ing: two 16-bit or one 32-bit component in a 32-bit register. 


To take advantage of the modified floating point pipeline to perform partitioned 
integer arithmetic, a VIS Instruction Set extension is included to support graphics 
and other applications with the following functions: 


1. Format conversions such as converting pixel data to fixed data format 
operating on either 16-bit or 32-bit components. 


2. Arithmetic operations such as partitioned add and subtract on either 16-bit 
or 32-bit components and seven variants of partitioned multiply 
instructions capable of 8-bit and 16-bit component multiplication. 


3. Logical operations that perform any one of 16 bitwise logical operations. 
4. Address handling instructions to deal with misaligned data. 


5. Array instructions to provide efficient access to three-dimensional (3D) 
data sets. 


6. Memory access instructions permitting partial stores of partitioned data 
and performing 8-bit and 16-bit loads and stores to and from 64-bit or 32- 
bit variables. 


7. Pixel distance instruction computing the absolute difference between 


corresponding 8-bit components in a pair of double precision registers and 
accumulating the sum of differences. 


1.3 Performance Advantage of VIS 


Figure 1-1 shows the performance advantage of a partitioned 8-bit x 16-bit multi- 
plication i.e four 8x16 multiplies performed in a single cycle resulting in a four- 
fold speedup. 
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Figure 1-1 Four multiplications performed in a single cycle 
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UltraSPARC Concepts 2 


2.1 Chapter Overview 


The UltraSPARC microprocessor has major hardware features that implement 64- 
bit SPARC V9 architecture, giving accelerated graphics performance using VIS. 
This chapter describes the following: 


* Functional Units Of the UltraSPARC I/II 
* UltraSPARC I/II front end 

* Integer Execution Unit (IEU) 

* Floating-point/Graphics Unit (FGU) 

* System Interface 

* UltraSPARC I/II Processor Pipeline 


* UltraSPARC III Instruction Set Extensions 


2.2 The Functional Units of UltraSPARC I/II 


Figure 2-1 is a simplified block diagram identifying the following major function- 
al units that make up UltraSPARC I/II. 


1. Front end — The Prefetch/Dispatch Unit (PDU) prefetches instructions 
based on a dynamic branch prediction mechanism and a next field address 
that allows "single cycle branch following." By predicting branches 
accurately (which typically occurs more than 90% of the time), the front 
end can supply four instructions per cycle to the core execution block. 





Integer Execution Unit (IEU) — Performs all integer arithmetic/logical 
operations. The IEU incorporates a novel 3D register file supporting seven 
read and three write ports. 


Floating-point/Graphics Unit (FGU) — Integrates five functional units and 
a Register File made up of 32 64-bit registers. The floating-point adder, 
multiplier, and divider, performing all floating-point operations, have been 
augmented by a graphics adder and multiplier to perform the partitioned 
integer operations required by the VIS Instruction Set. 


Load Store Unit (LSU) — Executes all instructions that transfer data 
between the memory hierarchy and the two register files in the IEU and 
the FGU. The Data Cache (D-Cache), Load Buffer, Store Buffer, and Data 
Memory Management Unit DMMU are included in this unit. 


External Cache (E-Cache) — Services “misses” from the Instruction Cache 
(I-Cache) in the UltraSPARC I front end and the D-Cache of the LSU. 
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Figure 2-1 Simplified Block Diagram of UltraSPARC-I/1I 


2.3 The UltraSPARC I/II Front End 


The UltraSPARC I/II front end is essentially the Prefetch/Dispatch Unit (PDU). 
Figure 2-2 shows the major components of the UltraSPARC-I/II front end. 
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Instructions are prefetched from a pseudo two-way 16kbyte instruction cache. 
Each line in the I-Cache contains eight instructions (32 bytes). Every pair of in- 
structions has a 2-bit branch prediction field that maintains the history of a possi- 
ble branch in the pair. The four prediction states are conventional: strongly taken, 
likely taken, strongly not-taken, and likely not-taken. The advantage of the in-cache 
prediction scheme is that it avoids the alias problems encountered in the branch 
history buffer and other similar structures. Every single branch in the I-Cache has 
its dedicated prediction bits (ignoring the rare case of branch couples), which 
translates into a successful prediction rate of 88% for integer code, 94% for float- 
ing-point (SPEC92), and 90% for typical database applications. 
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Figure 2-2 UltraSPARC-I/II Front End 
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Every group of four instructions in the cache has a “next field” that is simply a 
pointer to where the prefetcher should access instructions for the very next cycle. 
In the case of sequential code or for code with a branch predicted not-taken, the 
next field points to the next four instructions in the cache. The next field will con- 
tain the I-Cache index (including the set) of the branch target if a branch is pre- 
dicted taken. The advantage of this scheme is that the next field can always be 
fed back to the I-Cache without qualifying a possible branch. In order to provide 
a one-cycle loop back to the I-Cache, a fast dual-ported structure was used to im- 
plement the next field and the branch prediction bits. Only one set of the cache is 
accessed during a fetch, thus saving power and reducing the cache cycle time. 
Both tags are read so that an incorrect set prediction can be corrected. A two-cy- 
cle penalty occurs for a set misprediction. The next field mechanism allows UI- 
traSPARC I/II to speculate five branches deep representing up to 18 instructions. 


Instructions prefetched by the PDU are expanded to 76 bits in order to facilitate 
decoding done by the grouping logic. These decoded instructions are forwarded 
to a 12-deep instruction buffer which allows the prefetcher to get ahead of the ex- 
ecution units. As long as the instruction queue is kept almost full, cache miss, set 
miss and micro-TLB (uTLB) miss penalties can be hidden from the execution 
units. 


A single entry uTLB provides the prefetcher with a local copy of the last virtual- 
to-physical address translation. In the rare case of a uTLB miss, a one-cycle fetch 
penalty is incurred in order to get the address from the 64-entry, fully-associative 
instruction-TLB (iTLB). 


The grouping logic always looks at the next four candidates in the instruction 
buffer and, based on resource availability and dependencies, issues up to four in- 
structions. Maintaining more than one Program Counter (PC) per group allows 
UItraSPARC I/II to dispatch, in the same group, instructions from two adjacent, 
basic blocks. 


2.3.1 Integer Execution Unit (IEU) 


The Integer Execution Unit (IEU) performs integer computation for all integer 
arithmetic/logical operations. The IEU, as shown in Figure 2-3, includes dual 64- 
bit adders implemented in dynamic circuitry, an inverter, and very little extra 
logic (muxes for immediate bypasses) that form the basic cycle time of the ma- 
chine (together with the data cache access). 
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Figure 2-3 Integer Execution Unit 
A separate 64-bit adder is provided for virtual address additions for memory 


instructions. A simple 64-bit integer multiplier and divider complement the IEU. 
The multiplication unit implements a 2-bit Booth encoding algorithm with an 
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“early-out” mechanism, with a typical latency of eight clock cycles. A 1-bit non- 
restoring subtraction algorithm is used in the divide unit, which yields a latency 
of 67 clock cycles for a 64-bit x 64-bit division. 


2.3.2 Floating-point/Graphics Unit (FGU) 


The Floating-point/Graphics Unit (FGU) shown in Figure 2-4 integrates five 
functional units and a 32-registers x 64-bits Register File. The floating-point 
adder, multiplier, and divider perform all FP operations while the graphics adder 
and multiplier perform the graphics operations of the VIS Instruction Set. 


Dispatch Unit 


Five Read Addresses 












Floating-point 
/Graphics 
Store Data Register File 


32, 64b regs 


Load Data 


Completion Unit 


Figure 2-4 Floating-point and Graphics Unit 
A maximum of two Floating-point/Graphics Operations (FGops) and one FP 
load/store operation are executed in every cycle (plus another integer or branch 


instruction). All operations, except for divide and square-root, are fully pipelined. 
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Divide and square-root operations complete out-of-order without inhibiting the 
concurrent execution of other FGops.The two graphics units are both fully pipe- 
lined and perform operations on 8-bit or 16-bit pixel components with 16-bit or 
32-bit intermediate results. 


The Graphics Adder performs single cycle partitioned add and subtract, data 
alignment, merge, expand, and logical operations. Four 16-bit adders are utilized 
and a custom shifter is implemented for byte concatenation and variable byte- 
length shifting. The Graphics Multiplier performs three-cycle partitioned multi- 
plication, compare, pack, and pixel distance operations. Four 8x16 multipliers are 
utilized, and a custom shifter is implemented. Eight 8-bit pixel subtractions, abso- 
lute values, additions, and a final alignment for each pixel distance operation are 
required. 


2.3.3 Load/Store Unit (LSU) 


The Load/Store Unit (LSU) executes all instructions that transfer data between 
the memory hierarchy and the Integer and Floating-point/Graphics Register files. 
The LSU includes the Data Cache, Load Buffer, Store Buffer, and is very closely 
coupled to the second level external cache. See Figure 2-5 for a functional dia- 
gram of the Load/Store Unit. 


2.3.3.1 Data Cache 


The Data Cache (D-Cache) is a 16kB, direct-mapped cache. It has a 32B (256 bits) 
line size with 16B (128 bits) sub-blocks. It is virtually-indexed and physically- 
tagged. The D-Cache is nonblocking and operates using a write-through, no- 
write-allocate policy. Strict inclusion with respect to the E-Cache is maintained, 
facilitating cache coherency. The D-Cache data SRAM is single-ported and can 
support a 64-bit load or a 64-bit store every cycle. In the event of a D-Cache miss, 
an entire sub-block (16B) can be written in one clock. The D-Cache tag SRAM has 
two ports: a read port and area/write port. These two ports allow a load or store 
to perform a tag look-up in parallel with the allocation for an older D-Cache 
miss. 


2.3.3.2 Load Buffer 


The load buffer can eliminate stalls caused by D-Cache misses, load-after-store 
hazards, and other conflicts. Nine entries were implemented to cover the addi- 
tional six-cycle latency of a D-Cache miss/E-Cache hit. A rate of one load E- 
Cache hit per cycle can be sustained. Early compiler results indicate that more 
than 50% (statically) of the loops in SPECfp92 are amenable to be software pipe- 
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lined, based on the E-Cache latency. These loops represent an even larger compo- 
nent of the dynamic execution time. The load buffer is organized as a circular 


queue. 
Register File 








64 








Load Store 
Buffer Buffer 


Second-Level Cache 


Integer/FP 
Completion 
Units 


Figure 2-5 Load/Store Unit 


Each load is enqueued with an indication of whether it hits or misses the D- 
Cache. This information is tracked for the lifetime of the operation, even in the 
presence of snoops. An age-based, associative comparison is performed in order 
to adjust the raw D-Cache hit/miss indicator of the incoming load to account for 
allocations or victimizations that may be performed by pending loads to that D- 
Cache line. Thus, the D-Cache tags are only checked once. 
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2.3.3.3 Store Buffer 


The eight-entry Store Buffer (each entry accounts for a 64-bit datum and its corre- 
sponding address) provides a temporary holding place for store operations until 
they can be “committed” and the D-Cache and/or the E-Cache is available. The 
E-Cache update is a two-step process. First, the E-Cache tags are checked for 
hit/miss; then, the E-Cache write occurs at some later time. The E-Cache tag and 
data RAM accesses are decoupled so that a tag check can occur in parallel with 
the E-Cache data write of an older store, thus maintaining a throughput of one 
store per clock. Additionally, consecutive stores to the same E-Cache line (64B) 
typically require only a single tag check, thus minimizing tag check transactions. 


Store compression combines the last two entries in the store buffer when they 
both write to the same 16B block. Any number of stores can be combined into one 
transaction. Hence, the number of data write transactions are minimized — an 
important concern since all stores must update the E-Cache, considering that the 
D-Cache is a write-through design. 


2.3.3.4 Data Memory Management Unit (DMMU) 


The data memory management unit DMMU incorporates a fully associative, 64- 
entry Translation Lookaside Buffer (TLB) that provides one virtual-to-physical 
address translation per cycle. Any combination of the 8kB, 16kB, 512kB and 4MB 
supported page sizes is allowed. A TLB miss is handled by software for simplici- 
ty and flexibility, with a simple hardware assist provided for speed. Two read- 
only registers contain pointers to translation table entries from the Translation 
Storage Buffer (TSB), defined as a simple, direct-mapped software cache. A sepa- 
rate set of eight global registers is accessible as temporary storage. 


2.3.4 External Cache 


The External Cache is used to service misses from the I-Cache in the UltraSPARC 
I/II front end and the D-Cache in the LSU. It is a physically addressed and phys- 
ically tagged SRAM implementation. The line size is 64-bytes. E-Cache sizes are 
model dependent (from 512kB to 4MB for UltraSPARC-I and from 512kB to 16MB 
for UltraSPARC-II). and are supported with E-Cache data protected by byte pari- 
ty. An internal, delayed write buffer minimizes the write after read (WAR) penal- 
ty. Writes to the SRAM core are delayed until the next write arrives and the buffer 
is fully bypassed inside the SRAM. 


The additional latency for an internal cache miss and E-Cache hit is six cycles 
(three internal and three external). Reads can be completed in every cycle, with 
data driven the second cycle after address and control signals. UltraSPARC I/II 
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does not differentiate between burst reads and two consecutive reads; signals 
used for a single read are simply replicated for each subsequent read. The reads 
are fully pipelined and, thus, full throughput is achieved. 


Writes can also be completed every cycle, with data driven the cycle after address 
and control. A dead cycle is created when switching direction on the data bus to 
avoid overlapping drivers. The total write-after-read (WAR) penalty is two cycles. 
There is no read-after-write (RAW) penalty. 


2.3.5 System Interface 


Figure 2-6 shows a complete UltraSPARC I/II subsystem, consisting of the UI- 
traSPARC I/II processor, synchronous SRAM components for the External Cache 
tags and data and two UltraSPARC I/II Data Buffer (UDB) chips. 


16 


Prefetch External 
N Cache 
Unit 


Second Tags 
Level 
Cache/ 
Memory 
Interface 
Unit 






















2543(parity) 
18 














External 
Cache 


~ 


syn: 128416 
Address 77% (parity) 
System Data 
128416 
Distributed .- בד‎ (ECC) 
Arbitration "d ` 
| System 1 
₪ 


———— oO 


Figure 2-6 UltraSPARC I/II System Interface 


The UDBs serve to electrically isolate the interaction between the CPU and E- 
Cache from the system bus and operate at the system clock frequency, which can 
be either one-half or one-third of the processor clock. Collectively, the UDBs have 
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FIFOs for eight 16-byte noncacheable stores, one 64-byte read buffer, two 64-byte 
write buffers, and a 64-byte copyback buffer. The large number of outstanding 16- 
byte stores is useful for maintaining peak store bandwidth to a frame buffer. 


System transactions are packet based, in the sense that address and data transfers 
are disjoint non-interfering events. A 36-bit address bus is used to deliver two-cy- 
cle request packets that begin a transaction. This bus can be shared by up to three 
other masters, in addition to a centralized system controller. 


Arbitration is distributed. Each master on the address bus has the same logic and 
sees all requests for the bus. There are five potential requests: four potential mas- 
ters plus one from a high-priority system controller. Arbitration is round-robin 
with a hysteresis effect to reduce latency for the last master. This helps reduce la- 
tency for bursts of transactions from the same master. A special parking mode ex- 
ists for uniprocessors that typically reduces arbitration latency to zero by keeping 
UltraSPARC I/II enabled onto the address bus between transactions. 


2.4 Processor Pipeline 


The functions performed by the IEU, LSU and FGU are implemented in a dual 
9-stage pipeline. Most instructions go through the pipeline in exactly nine stages. 
The instructions are considered terminated after they go through the last stage 
(W), after which, changes to the processor state are irreversible. Figure 2-7 shows 
a diagram of the integer and floating-point pipeline stages. Three additional stag- 
es are added to the integer pipeline to make it symmetrical with the floating- 
point pipeline. This simplifies pipeline synchronization and exception handling 
and eliminates the need to implement a floating-point queue. 


Floating-point instructions with a latency greater than three (divide, square root, 
and inverse square root) behave differently than other instructions, in the sense 
that the pipe is "extended" when the instruction reaches stage N;. Memory oper- 
ations are allowed to proceed asynchronously with the pipeline in order to sup- 
port latencies longer than the latency of the on-chip data cache. 
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Figure 2-7 UltraSPARC I/II Nine-stage Dual Pipeline. 


2.5 Pipeline Stage Description 


2.5.1 Stage 1: Fetch (F) Stage 


In this stage instructions are fetched from the instruction Cache (I-Cache) and 
placed in the Instruction Buffer, from where they will be selected for execution. 
Up to four instructions are fetched, along with branch prediction information, the 
predicted target address of a branch, and the predicted set of the target. The high 
bandwidth provided by the I-Cache (four instructions/cycle) allows the UltraS- 
PARC I/II to prefetch instructions ahead of time, based on the current instruction 
flow and branch prediction. Providing a fetch bandwidth greater than, or equal 
to, the maximum execution bandwidth assures that (for well behaved code) the 
processor does not starve for instructions. Exceptions to this rule occur when 
branches are hard to predict, when branches are very close to each other, or when 
the I-Cache miss rate is high. 
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2.5.2 Stage 2: Decode (D) Stage 


In this stage the fetched instructions are pre-decoded and sent to the Instruction 
Buffer. The pre-decoded bits generated during this stage accompany the instruc- 
tions during their stay in the Instruction Buffer. Upon reaching the next stage 
(where the grouping logic lives), these bits speed up the parallel decoding of up 
to four instructions. 


While it is being filled, the Instruction Buffer also presents up to four instructions 
to the next stage. A pair of pointers manage the Instruction Buffer, ensuring that 
as many instructions as possible are presented in order to the next stage. 


2.5.8 Stage 3: Grouping (G) Stage 


In this stage, the main task is to group and dispatch a maximum of four valid in- 
structions in one cycle. It receives a maximum of four valid instructions from the 
Prefetch and Dispatch Unit (PDU), controls the Integer Core Register File (ICRF), 
and routes valid data to each integer functional unit. The G Stage sends up to two 
floating-point or graphics instructions out of the four candidates to the Floating- 
point/Graphics Unit (FGU). Additionally, the logic in the G Stage is responsible 
for comparing register addresses for integer data bypassing and for handling 
pipeline stalls due to interlocks. 


2.5.4 Stage 4: Execution (E) Stage 


In this stage, data from the integer register file is processed by the two integer 
ALUs during this cycle (if the instruction group includes ALU operations). Re- 
sults are computed and are available for other instructions (through bypasses) in 
the very next cycle. The virtual address of a memory operation is calculated in 
this stage in parallel with ALU computation. 


In the Floating-point/Graphics pipe, this stage corresponds to the Register (R) 
Stage of the FGU. The floating-point register file is accessed during this cycle. The 
instructions are further decoded and the FGU control unit selects the proper by- 
passes for the current instructions. 


2.5.5 Stage 5: Cache Access (C) Stage 
In this stage, the virtual addresses of memory operations calculated in the E Stage 
are sent to the tag RAM to determine if the access (load or store type) is a hit or a 


miss in the D-Cache. In a parallel operation, the virtual address is sent to the data 
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MMU to be translated into a physical address. On a load when there are no other 
outstanding loads, the data array is accessed so that the data can be forwarded to 
dependent instructions in the pipeline as soon as possible. 


ALU operations executed in the E Stage generate condition codes in the C Stage. 
The condition codes are sent to the PDU, which checks to determine if a condi- 
tional branch in the group has been correctly predicted. If the branch has been 
mispredicted, earlier instructions in the pipe are flushed and the correct instruc- 
tions are fetched. The results of ALU operations are not modified after the E 
Stage; the data merely propagates down the pipeline (through the annex register 
file), where it is available for bypassing for subsequent operations. 


In the Floating-point/Graphics pipe, this is the X; Stage. Instructions start their 
execution during this stage. Instructions of latency one also finish their execution 
phase during the X; Stage. 


2.5.6 Stage 0: Ny Stage 


In this stage, a data cache miss/hit or a TLB miss/hit is determined. If a load 
misses the D-Cache, it enters the Load Buffer. The access arbitrates for the E- 
Cache if there are no older, unissued loads. If a TLB miss is detected, a trap is tak- 
en and the address translation obtained by a software routine. The physical ad- 
dress of a store is sent to the Store Buffer during this stage. To avoid pipeline 
stalls when store data is not immediately available, the store address and data 
parts are decoupled and separately sent to the Store Buffer. 


In the Floating-point/Graphics pipe, this is the second execution stage (X;) where 
execution continues for most instructions. 


2.5.7 Stage 7: No Stage 


In this stage, the Integer Pipe essentially waits for the Floating-point/Graphics 
pipe to complete. Most floating-point instructions in Floating-point/Graphics 
pipe finish execution during this stage. After Ny, data can be bypassed for other 
stages or forwarded to the data portion of the Store Buffer. All loads that have en- 
tered the Load Buffer in N continue their progress through the buffer; they will 
reappear in the pipeline only when the data comes back. Normal dependency 
checking is performed on all loads, including those in the load buffer. 


2.5.8 Stage 8: N3 Stage 


In this stage, the Integer and Floating-point/Graphics pipes converge to resolve 
traps. 
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2.5.9 Stage 9: Write (W) Stage 


In this stage, all results ( integer and floating-point) are written to the register 
files. All actions performed during this stage are irreversible. After this stage, in- 
structions are considered terminated 


2.6 Performance Improvement 


The expanded hardware capabilities of the UltraSPARC I/II processor offers a 
sustained execution rate of four instructions per cycle even in the presence of 
conditional branches and cache misses. Typically this may include the simulta- 
neous execution of two floating-point/graphics, one integer and one load/store 
instruction per cycle. 


2.7 UltraSPARC III Instruction Set Extensions 


UltraSPARC III has added Sun proprietary extensions to the SPARC-V9 Instruc- 
tion Set Architecture (ISA), in addition to those implemented in UltraSPARC I/II. 
The extensions are in the areas of VIS extensions, prefetch enhancement, and in- 
terval arithmetic support. 


2.7.1 VIS Extensions 


Three new VIS instructions were added: 


Byte Mask — Sets the Graphics Status Register (GSR) for a following byte 
shuffle operation. One byte mask can be issued per instruction group as the 
last instruction of the group. 


Byte Mask is a break-after instruction. 


Byte Shuffle — Allows any set of 8 bytes to be extracted from a pair of 
double-precision, floating-point registers and written to a destination double- 
precision, floating-point register. The 32-bit byte mask field of the GSR 
specifies the pattern of source bytes for the byte shuffle instruction. 


Edge(ncc) — Two variants: the original instruction sets the integer condition 
codes, and the new instruction does not set condition codes. Differences 
between the variants are as follows: 


Edge — Sets integer condition codes, single instruction group. 


Edgencc — Does not sets integer condition codes, groupable. 
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Because of implementation restrictions in the pipeline, all instructions that set 
condition codes and execute in the MS pipeline stage must be in a single instruc- 
tion group. 


2.7.2 Prefetch Enhancement 


The processor supports an instruction to invalidate a prefetched line. It invali- 
dates a prefetch cache line after prefetched noncacheable data have been loaded 
into registers and on error conditions. 


2.7.8 Interval Arithmetic Support 


One new instruction was added to improve the efficiency of interval arithmetic 
computations. The Set Interval Arithmetic Mode (SIAM) instruction enables the 
rounding mode bits in the Floating-Point Status Register (FSR) to be overridden 
without the overhead of modifying the Rp field of the FSR. Updates directly to 
FSR are expensive because they flush the pipeline. 


Chapter2 UltraSPARC Concepts 1 
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Developing VIS Applications 3 


3.1 Chapter Overview 


This chapter describes the application development process and includes in the 
following topics: 


* How to build a 32-bit VIS 1.0 application 
* How to build a 32-bit VIS 2.0 application 
* How to build a 64-bit VIS 1.0 application 
* How to build a 64-bit VIS 2.0 application 


Note: A 32-bit VIS 1.0 application can be run on either a 32-bit or 64-bit Solaris 
environment with an UltraSPARC I/II/III processor. A 32-bit VIS 2.0 application 
can be run on either a 32-bit or 64-bit Solaris environment with at least an 
UltraSPARC-III processor. A 64-bit VIS 1.0 application can be run only on a 64-bit 
Solaris environment with an UltraSPARC I/II/III processor. A 64-bit VIS 2.0 
application can be run only on a 64-bit Solaris environment with at least an 
UltraSPARC-III processor. 





The three steps to building a VIS application are coding, compiling, and linking. 
They are described in the subsection below. 
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Table 3-1 Summary of VIS Application Development Requirements. 


ees 32-bit VIS Application 64-bit VIS Application 
OS Mode ompile Link Run Compile Link Run 


UltraSPARC I&II 


3 
Q 
- 
a» 
₪ 
5 
S 


UItraSPARC III 














nvironment 
SPARCompiler 4.0 or later for Sun WorkShop 5.0 or later for 
applications using VIS 1.0 applications using VIS 1.0 
Compiler 
Sun WorkShop 5.0 or later for Sun Workshop 6 update 1 or 
applications using VIS 2.0 later for applications VIS 2.0 
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3.2 Building a 32-bit VIS 1.0 application 


To build a 32-bit VIS 1.0 application, it is necessary to use the SPARCompiler 4.0 
or later on a SPARC system running Solaris 2.5 or later. Note: in order to run a 32- 
bit VIS 1.0 application, at least an UltraSPARC-based system is required. Building 
a 32-bit VIS 1.0 application requires the following three steps: 


1. Coding 


The appropriate header files should be included in the code. For example: 


finclude <vis_types.h> 


#include <vis_proto.h> 
2. Compiling 


During compiling, it is necessary to: 

e use the -xarch-v8plusa flag 

* indicate the location of the header files 

* provide the path to the 32-bit VIS inline macro file 


For example, assume VSDK is installed in the default location, /opt, to 
compile file prog.c 


$ cc -c -xarch-v8plusa -I/opt/SUNWvsdk/include 
/opt/SUNWvsdk/lib/vis, 32.il prog.c 


3. Linking 


The -xarch-v8plusa flag is required during linking. For example, to 
create the binary prog from object prog.o 


$ cc -o prog -xarch-v8plusa prog.o 
Use command file )1( to check the file types of the objects and binaries. 
For example, a 32-bit VIS 1.0 object and binary have the following output: 


$ file prog.o prog 

prog.o: ELF 32-bit MSB relocatable SPARC32PLUS Version 1, 78+ 
Required, 

UltraSPARC1 Extensions Required 





prog: ELF 32-bit MSB executable SPARC32PLUS Version 1, V8r 
Required, 
UltraSPARC1 Extensions Required, dynamically linked, not 
stripped 
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3.3 Building a 32-bit VIS 2.0 application 


To build a 32-bit VIS 2.0 application, it is necessary to use the Sun Workshop 5.0 
or later on a SPARC system running Solaris 2.5 or later. Note: in order to run a 32- 
bit VIS 2.0 application, at least an UltraSPARC-llI based system is required. 
Building a 32-bit VIS 2.0 application requires the following three steps: 


1. Coding 


You should include the appropriate header files in the code. For example: 


finclude <vis_types.h> 


#include <vis_proto.h> 
2. Compiling 


During compiling, it is necessary to: 

* use the -xarch-v8plusb andthe -DVIS-0x200 flags 
* indicate the location of the header files 

* provide the path to the 32-bit VIS inline macro file 


For example, assume VSDK is installed in the default location, /opt, to 
compile file prog.c 


5 cc -c -xarch-v8plusb -DVIS-0x200 -I/opt/SUNWvsdk/include 
/opt/SUNWvsdk/lib/vis, 32.il prog.c 


3. Linking 


The -xarch-v8plusb flag is required during linking. For example, to 
create the binary prog from object prog.o 


9 


$ cc -o prog -xarch-v8plusb prog.o 
Use command file )1( to check the file types of the objects and binaries. 
For example, a 32-bit VIS 2.0 object and binary have the following output: 


9 


% file prog.o prog 





prog.o: ELF 32-bit MSB relocatable SPARC32PLUS Version 1, V8+ 
Required, 

UltraSPARC3 Extensions Required 

prog: ELF 32-bit MSB executable SPARC32PLUS Version 1, V8r 
Required, 
UltraSPARC3 Extensions Required, dynamically linked, not 
stripped 
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3.4 Building a 64-bit VIS 1.0 application 


To build a 64-bit VIS 1.0 application, it is necessary to use the WorkShop Compil- 
er 5.0 or later on a SPARC system running Solaris 7 or later. Note: although a 64- 
bit application can be built in either a 32-bit or a 64-bit Solaris environment, it can 
be run only in a 64-bit Solaris environment. Use the isainfo(1) command to 
check the mode of the Solaris environment. 


For example, the output of a 64-bit environment is: 


% isainfo -v 
64-bit sparcv9 applications 
32-bit sparc applications 


and the output of a 32-bit environment is: 


9 


$ isainfo -v 
32-bit sparc applications 


Building a 64-bit VIS 1.0 application requires the following three steps: 
1. Coding 


The appropriate header files should be included in the code. For example: 
#include <vis_types.h> 
#include <vis_proto.h> 


2. Compiling 


During compiling, it is necessary to: 

* use the -xarch-v9a flag 

* indicate the location of the header files 

* provide the path to the 64-bit VIS inline macro file 


For example, assume VSDK is installed in the default location, /opt, to 
compile file prog.c 


$ cc -c -xarch-v9a -I/opt/SUNWvsdk/include 
/opt/SUNWvsdk/lib/vis_64.il prog.c 


3. Linking 


The -xarch=v9a flag is required during linking. For example, to create the 
binary prog from object prog.o 
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$ cc -o prog -xarch-v9a prog.o 


Use command file )1( to check the file types of the objects and binaries. 
For example, 64-bit VIS 1.0 object and binary have following output: 
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$ file prog.o prog 

prog.o: ELF 64-bit MSB relocatable SPARCV9 Version 1, 1 
Extensions Required 

prog: ELF 64-bit MSB executable SPARCV9 Version 1, UltraSPARCl 
Extensions Required, dynamically linked, not stripped 











Note: Note: in order to successfully build a 64-bit application, all objects and 


libraries used must be a 64-bit version. Refer to "Solaris 7 64-bit Developer’s Guide" 
(Part No: 805-6250-10) for more information on how to build a 64-bit application. 
It is available from following URL: 


http://docs.sun.com:80/ab2/coll.45.10/SOL64TRANS/ 


3.5. Building a 64-bit VIS 2.0 application 


To build a 64-bit VIS 2.0 application, it is necessary to use the Sun Workshop 6 
update 1 (a.k.a. Forte Developer 6 update 1) or later on a SPARC system running 
Solaris 7 or later. Note: in order to run a 64-bit VIS 2.0 application, at lease an Ul- 
traSPARC-III based system is required. Additionally, although a 64-bit applica- 
tion can be built in either a 32-bit or a 64-bit Solaris environment, it can be run 
only in the 64-bit Solaris environment. Use the isainfo(1) command to check 
the mode of the Solaris environment. 


For example, the output of a 64-bit environment is: 


$ isainfo -v 
64-bit sparcv9 applications 
32-bit sparc applications 


and the output of a 32-bit environment is: 


$ isainfo -v 
32-bit sparc applications 


Building a 64-bit VIS 2.0 application reguires the following three steps: 
1. Coding 


The appropriate header files should be included in the code. For example: 


finclude <vis_types.h> 
#include <vis_proto.h> 
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Compiling 


During compiling, it is necessary to: 

* use the -xarch-v9b andthe -DVIS-0x200 flags 
* indicate the location of the header files 

* provide the path to the 64-bit VIS inline macro file 


For example, assume VSDK is installed in the default location, /opt, to 
compile file prog.c 


oe 


cc -c -xarch-v9b -DVIS-0x200 -I/opt/SUNWvsdk/include 
/opt/SUNWvsdk/lib/vis_64.il prog.c 


Linking 


The -xarch—v9b flag is reguired during linking. For example, to create the 
binary prog from object prog.o 


$ cc -o prog -xarch-v9b prog.o 


Use command file (1) to check the file types of the objects and binaries. 
For example, 64-bit VIS 2.0 object and binary have following output: 


o 


$ file prog.o prog 

prog.o: ELF 64-bit MSB relocatable SPARCV9 Version 1, UltraSPARC3 
Extensions Reguired 

prog: ELF 64-bit MSB executable SPARCV9 Version 1, 3 
Extensions Reguired, dynamically linked, not stripped 
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VIS Instructions 4 


4.1 Chapter Overview 


This chapter describes the comprehensive set of VIS instructions that is primarily 
used to write graphics and multimedia applications, but is not restricted to this. 
While the majority of the instructions have a C interface via an inline mechanism, 
some (for example, the block load and block store instructions) do not have a C 

interface and must be written in assembly language. 


Topics included in this chapter are: 

* Definition of the data structures used 
* Description of Utility Inlines 

* Description of Logical Instructions 

* Description of Arithmetic Instructions 
* Description of Packing Instructions 

* Description of Array Instructions 


* Code examples illustrating VIS 
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4.2 Data Types Used 


Figure 4-1 shows the data types used: 


signed char: 


vis_s8 


unsigned char: 
vis_u8 


signed short: 


vis s16 


unsigned short: 
vis ul6 


signed int: 


vis s32 


unsigned int: 
vis 2 


float: 
vis f32 


double: 
vis d64 


ILP32,signed long long; 
LP64,signed long: 


vis s64 
63 62 


ILP32,unsigned long long; 
LP64,unsigned long: 
vis u64 


63 
ILP32,unsigned long: 
vis addr 
31 
LP64,unsigned long: 
vis addr 
63 
Figure 4-1 Graphics Data Formats 


All VIS signed values are 2’s complement. 
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Note: vis addr is defined to have the same length as pointers. Therefore, for 
ILP32 data model, it is the same as vis. u32; for LP64 data model, it is the same 
as vis u64. 
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4.2.1 Partitioned Data Formats 


Figure 4-2 shows some of the partitioned data formats used. 


ug 
31 23 15 7 0 
An example of four 8-bit unsigned integers contained in a 32-bit 
variable. Typically they represent intensity values for an image pixel, 
for example, o, B, G, R. 


0 


31 16 15 


An example of two 16-bit signed fixed point values contained in a 
32-bit variable. For example they may represent filter coefficients or 
scaling factors. 


63 47 31 15 0 
An example of four 16-bit signed fixed point values contained in a 


vis_d64 variable. For example they may represent the result of 
partitioned multiplication. 


63 55 3 23 15 7 0 


ug 
47 39 1 
An example of eight 8-bit values contained in a vis_d64 variable. 
Typically, they would represent two pixels. 


Figure 4-2 Partitioned Data Formats 


4.2.2 Fixed Data Formats 


Fixed data values provide an intermediate format with enough precision and dy- 
namic range for filtering and simple image computations on pixel values. Con- 
version from pixel data to fixed data occurs through pixel multiplication or 
application of the vis fexpand0 instruction. Conversion from fixed data to pixel 
data is performed with the pack instructions, which clip and truncate to an 8-bit 
unsigned value. Conversion from 32-bit fixed to 16-bit fixed is also supported 
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with the vis_fpackfix() instruction. Rounding can be performed by adding one to 
the round bit position. Complex calculations requiring more dynamic range or 
precision should be performed by using floating-point data. 


4.2.3 Include Directives 


The following include directives apply to all code examples: 


finclude "vis types.h" 


finclude "vis proto.h" 


4.8 Utility Inlines 


Utility inlines are not part of the VIS extension and are included to complement 
the use of the VIS. These instructions offer the ability to read and write upper and 
lower components of floating-point registers and to modify the contents of the 
Graphics Status Register. 


4.3.1 vis write gsr[32, 64]0, vis read esr[32, 64] 


Function 
Assign a value to the Graphics Status Register (GSR) and read the Graphics 
Status Register. 
Syntax 
vis u32 vis, read gsr32(); 
void vis write gsr32(vis u32 gsr); 
vis u64 vis, read gsr64(); 


void vis write gsró64(vis u64 gsr); 


Description 
vis write gsr32() writes to the lower 32 bits of the Graphics Status Register. 
vis read gsr320 reads the lower 32 bits of the Graphics Status Register. 
vis write gsr64() writes all settable bits of the Graphics Status Register. 
vis read gsr640 reads all settable bits of the Graphics Status Register. 
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63 7 6 3 2 0 
Figure 4-3 Graphics Status Register format (UltraSPARC I&II) 


MASK = IRND x3 SCALE ALIGN 
7 3 2 0 


63 32 31 28 27 26 25 24 8 
Figure 4-4 Graphics Status Register format (UltraSPARC III) 


Table 4-1 _ GSR Bit Description 
Bit Field Description 
63:32 MASK<31:0> This field specifies the mask used by the BSHUFFLE instruction. The field 
contents are set by the BMASK instruction. 
31:28 Reserved 


27 IM Interval Mode: When IM = 1, the values in FSR. RD and FSR.NS are ignored; the 
processor operates as if FSR.NS = 0 and rounds floating-point results according 
to GSR. IRND. 


26:25 IRND<1:0> IEEE Std 754-1985 rounding direction to use in Interval Mode (GSR. IM = as 


follows: 
IRND Round toward ... 
0 Nearest (even if tie) 
1 0 
2 + oo 
3 — eo 


When GSR. IM = 1, the value in GSR. IRND overrides the value in FSR.RD. 
24:8 Reserved 


7:3 SCALE<4:0> Shift count in the range 0-31, used by the PACK instructions for formatting. 


2:0 ALIGN<2:0> Least three significant bits of the address computed by the last executed 
ALIGNADDRESS or ALIGNADDRESS, LITTLE instruction. 


Example 


/* This example illustrates writing to the GSR and changing the 
50816 factor only*/ 
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vis 88 7 
vis write gsr32((scalef << 3) | (vis read gsr32() 8 0x7)); 


Note: For multi-threaded VIS applications, the Graphics Status Register (GSR) is 
a resource that can be shared between multiple threads. Ensure that, after setting 
the GSR register, a thread does not voluntarily give up control (for example, via a 
mutex) to another thread that also sets the GSR register. If this occurs, the 
contents of the GSR cannot be relied on after the first thread regains control. 
However, if the same thread is involuntarily made to give up control to the other 
thread (for example, by an interrupt from the operating system), then the 
operating system will perform the necessary context switch, so that each thread 
can rely on the GSR being uncorrupted. 


Note: Aliases to vis read gsr() and vis write gsr() have been created 
as vis read gsr32()and vis write gsr32(),respectively. When using 
vis write gsr32() תס‎ UItraSPARC-III, the upper 32-bit of GSR.mask is 
undefined, and should not be relied on. 


Note: vis read gsr64() and vis write gsr64() can be used in both 32- 
bit mode and 64-bit mode. In the 32-bit mode, vis_u64 is the same as unsigned 
long long, which makes vis, read, gsr64() and vis write  gsr64() not 
strictly conform to ANSI/ISO C standard. 


Note: The 32164 in _gsr32 and  gsr64 has a different meaning from that in 
vis 32.iland vis 64.il. The former represents how many bits in GSR are 
considered, while the latter represents which mode of the OS is used. 

















4.3.2 vis read hi(), vis read lo(),vis write hi(, vis write lo() 


Function 


Read and write to the upper or lower component of a vis d64 variable. 


Syntax 


vis 532 vis, read hi(vis d64 variable); 
vis f32 vis read lo(vis d64 variable); 
vis d64 vis write hi(vis d64 variable, vis, 532 uppercomp); 
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vis d64 vis write lo(vis d64 variable, vis, 132 lowercomp) ; 


Description 


vis read hi6, vis read lo), and vis write hi(, vis write lo() permit read 
and write operations to the upper uppercomp or lower lowercomp 32-bit 
components of a vis d64 variable. However, code written with these 
instructions cannot be optimized as easily as that written by using 

vis freg pair(). 


Example One: 


vis d64 data. 64; 

vis f32 data 32; 

/* Extracts the upper 32 bits of data 64 and places them into 
data 32 */ 

data 32 = vis, read hi (data, 64); 





In practice, the compiler can often accomplish the same effect by taking 
advantage of register pairs. For example, if the value data, 64 resides in the 
register $d30, vis read hi (data 64) becomes a reference to $£30, and 
vis read lo(data 64) becomes a reference to $£31 in the generated 
assembly code. 


Example Two: 


vis d64 data. 64; 

vis f32 data 32; 

/* Writes data 32 to the lower portion of data 64 leaving the upper 
half of data 64 intact */ 

data 64 = vis write lo(data 64, data 32); 


If data. 64 resides in $430 and data. 32 resides in $£5, then the C statement 
could be translated to the following assembly-language statement: 
fmovs $f5, $331 


4.3.3 vis freg pair() 


Function 


Join two vis f32 variables into a single vis d64 variable. 


Syntax 
vis d64 vis, freg pair(vis f32 datal 32, vis 132 data2 32); 


Description 


vis freg pair() joins two vis f32 values data1 32 and data2 32 into a single 
vis d64 variable. This offers a more optimum way of performing the 
equivalent of using vis write hi() and vis write lo() since the compiler 
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attempts to minimize the number of floating-point move operations by 
strategically using register pairs. 


Example 


vis_f32 datal_32, data2_32; 
vis_d64 data_64; 


/* Produces data_64, with datal_32 as the upper and data2_32 as the 
lower component.*/ 
data 64 = vis freg pair(datal 32, data2 32); 


4.3.4 vis to float() 


Function 


Place a vis 132 variable into a floating-point register without performing a 
floating-point conversion. 


Syntax 
vis 532 vis to float(vis u32 data 32) 


Description 


The semantics of the C compiler require a format conversion when 
assigning an integer data 32 to a float variable. Since the VIS does not 
operate with floating-point variables, but uses only the floating-point 
registers, vis to float() bypasses the float conversion and stores the 
unmodified bit pattern in a floating-point register. 

The semantics of the C compiler require a format conversion when 
assigning an integer data 32 to a float variable. Since the VIS does not 
operate with floating-point variables, but uses only the floating-point 
registers, vis to float() bypasses the float conversion and stores the 
unmodified bit pattern in a floating-point register. 


Example 


vis u32 data 32; 
vis 532 f; 


f = vis to float (data, 32); 


/*The same result would be achieved by the following statement*/ 
/*1 = *((vis f32*) &data 32);* 





/*Taking an illustrative example */ 
data 32 = 21845; 
/* = 5555 (base 16) = 0101010101010101 (base 2) */ 
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f = data 32; 

/* will result in f containing a floating-point representation of 
"21845.0", which will have a completely different bit pattern than 
the one shown.*/ 


f= vis, to float (data 32); 
/* Causes the desired bit pattern to be placed into f */ 


4.3.5 vis to double(, vis to double dup, vis Il to double() 


Function 


vis to double() and vis to double dup() places two vis 132 values 
into a vis d64 variable. 


vis ll to double() places a vis addr value into a vis d64 variable. 


Syntax 


vis d64 vis to double(vis_u32 08581 32, vis u32 data2 32); 
vis d64 vis, to double dup(vis u32 data 32); 
vis d64 vis, ll to double (vis, u64 data 64); 


Description 


vis to. double() places two vis 132 variables datal 32 and data2 32 in the 
upper and lower halves of a vis d64 variable. vis to double dupO places 
the same vis 432 variable data, 32 in the upper and lower halves of a vis_ 
d64 variable. vis 11 to double( places the vis 164 variable data 64 in a vis_ 
d64 variable. 


Example 


vis u32 datal 32, data2, 32; 
vis addr data, 64; 
vis d64 resultl 64, result2 64, result3. 64; 


result1_64 = vis to double(datal 32, data2. 32); 
/*datal 32 in upper half and data2 32 in lower half*/ 


result2, 64 = vis, to double dup (datal, 32); 
/*datal, 32 in upper and lower halves*/ 

/*vis, to double dup(datal 32) is equivalent to 
vis. to double (datal, 32,datal. 32) */ 





result3 64 = vis 11 to double (data, 64); 
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Note: In the 32-bit mode, vis_u64 is the same as unsigned long long, 
which makes vis 11 to double() not strictly conform to ANSI/ISO C standard. 


4.4 VIS Logical Instructions 


These Instructions include logical operations involving none, one, or two argu- 
ments. 


4.4.1 vis fzero(), vis fzeros(), vis fone, vis fones() 


Function 


Set variable to all ones (base 2) or clear variable to zero. 


Syntax 


vis d64 vis fzero(void); 
vis 532 vis, fzeros (void); 
vis d64 vis fone(void); 
vis 532 vis fones(void); 
Description 
vis fzero() and vis fzeros() return vis d64 and vis f32 zero-filled variables 
and vis fone() and vis_fones() return vis d64 and vis f32 one-filled 
variables. 
Example 
vis, f32 data 32; 
vis d64 data. 64; 


data 64 = vis fzero(); /* data 64 holds 0x0000000000000000 */ 
data 32 = vis fones(); /* data 32 holds Oxffffffff */ 


These instructions set all 64 bits of data 64 to zeros or ones. They are useful for 
initializing variables, since 0010 64 may be regarded as a partitioned variable con- 
taining two 32-bit or four 16-bit zero values. (See 4.6, "Arithmetic Instructions," 
on page 47.) 


4.4.2 vis fsrc(, vis fsrcs(), vis fnot(), vis fnots() 


Function 


Copy a value or its complement. 
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Syntax 
vis_d64 vis fsrc(vis d64 data 64); 
vis f32 vis fsrcs(vis f32 data 32); 
vis d64 vis, fnot(vis d64 data 64); 
vis 532 vis fnots(vis f32 data 32); 


Description 


vis fsrc() copies one vis d64 variable to another and vis fnot() copies the 
complement of one vis d64 variable to another. vis fsrcs() copies one 32-bit 
variable to another and vis fnots() copies the complement of one 32-bit 
variable to another. 


Example 


vis 532 datal 32, data2, 32; 
vis d64 datal 64, data2. 64; 


datal 32 = vis fsrc(data2 32); /* same as datal 32 data2 32 */ 


datal 64 = vis fnot(data2 64); /* same as datal 64 = -data2 64 */ 


4.4.3 vis f[or, and, xor, nor, nand, xnor, ornot, andnot][s]) 


Function 


Perform logical operations between two 32-bit or two vis, d64 partitioned 
variables. 


Syntax 


vis d64 vis for(vis d64 08581 64, vis d64 data2 64); 

vis f32 vis fors(vis f32 datal 32, vis f32 data2 32); 
vis d64 vis fand(vis d64 08581 64, vis 064 data2. 64); 
vis 532 vis fands(vis 532 08581 32, vis 532 data2 32); 
vis d64 vis fxor(vis d64 datal 64, vis d64 data2 64); 
vis 532 vis fxors(vis 532 08581 32, vis 532 data2 32); 
vis d64 vis fnor(vis d64 datal 64, vis d64 data2 64); 
vis 532 vis fnors(vis, 532 08581 32, vis 532 data2 32); 
vis d64 vis fnand(vis, d64 08581 64, vis d64 data2 64); 
vis f32 vis fnands(vis f32 08581 32, vis f32 data2 32); 
vis d64 vis fxnor(vis d64 08581 64, vis d64 data2 64); 
vis f32 vis fxnors(vis f32 datal 32, vis f32 data2 32); 
vis d64 vis fornot(vis d64 datal 64, vis d64 data2 64); 
vis 532 vis fornots(vis 532 datal 32, vis 532 data2 32); 
vis d64 vis fandnot(vis d64 datal 64, vis d64 data2 64); 
vis 532 vis fandnots(vis 132 datal 32, vis 132 data2 32); 
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Description 


The 64-bit version of these instructions performs one of eight 64-bit logical 
operations between data1_64 and data2_64. The 32-bit version of these 
instructions performs one of eight 32-bit logical operations between data1_ 
32 and data2_32. 


Example 


vis f32 datal 32, data2 32, result_32; 

vis d64 datal 64, data2 64, result. 64; 

/* result 64 holds the result of a logical operation between 
38681 64 and data2 64*/ 

/* result 32 holds the result of a logical operation between 
datal 32 and data2 32*/ 


result 64 = vis for(datal 64, data2. 64); 
/* result 64 - datal 64 | data2 64 */ 


result 32 = vis fors(datal 32, data2. 32); 
/* result 32 - datal 32 | data2 32 */ 
result 64 = vis fand(datal, 64,data2. 64); 
/* result 64 - datal 64 & data2 64 */ 
result 32 vis fands (datal, 32, data2, 32); 


/* result 32 = datal 32 & data2 32 */ 


result 64 = vis fxor(datal 64, data2. 64); 
/* result 64 = datal 64 ^ data2 64 */ 
result 32 vis fxors(datal 32, data2 32); 














/* result 32 - datal 32 ^ data2 32 */ 














result 64 - vis fnor(datal 64, data2. 64); 
/* result 64 = -«(datal 64 | data2 64) */ 
result 32 - vis fnors(datal 32, data2 32); 
/* result 32 = -«(datal 32 | data2 32) */ 
result 64 = vis fnand(datal 64, 38682 64); 
/* result 64 = -«(datal 64 & data2 64) */ 
result 32 = vis fnands(datal 32, data2 32); 
/* result 32 = -«(datal 32 & data2 32) */ 
result 64 = vis fxnor(datal 64, data2. 64); 
/* result 64 = -)08%81 64 ^ data2 64) */ 
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result_32 = vis_fxnors(datal_32, data2_32); 
/* result 32 = ~(datal_32 ^ data2 32) */ 
result 64 = vis fornot(datal 64, data2. 64); 
/* result 64 = ((«datal. 64) | data2 64) */ 
result 32 = vis fornots(datal 32, data2 32); 
/* result 32 = ((«datal 32) | data2 32) */ 
result 64 = vis fandnot (datal 64, data2 604); 
/* f = ((«datal 64) & data2 64) */ 

result 32 = vis fandnots(datal 32, data2, 32); 
/* result 64 - ((«datal 32) & data2 32) */ 





4.5 Pixel Compare Instructions:Pixel Compare Instructions: 
vis_fcmplgt, le, eq, ne, lt, gell16,3210 


Function 


Perform logical comparison between two partitioned variables, and 
generate an integer mask describing the result of the comparison. 


, 


data2 4 16 
data2 4 16 
data2 4 16 
data2 4 16 
data2 2 32 
data2 2 32 
data2 2 32 

data2 2 32); 
data2 4 16); 
data2 2 32); 
data2 4 16); 
data2 2 32); 


, 


, 


, 


, 


, 


, 


) 
) 
) 
) 
) 
) 
) 
) 


vis d64 
vis d64 
vis d64 
vis d64 
vis d64 
vis d64 
vis d64 
vis d64 
vis d64 
vis d64 
vis d64 
vis d64 


08581 4 16, 
08581 4 16, 
08581 4 16, 
08581 4 16, 
08581 2 32, 
datal 2 32, 
08581 2 32, 
datal 2 32, 
08581 4 16, 
08581 2 32, 
08581 4 16, 
datal 2 32, 


Syntax 


int vis, fcmpgtl6(vis, 4 
int vis, fecmpleló6(vis 4 
int vis_fcmpegl16(vis_d64 
int vis, fcmpneló6 (vis, d64 
int vis, fcmpgt32 (vis, 4 
int vis, fcmpeq32 (vis, 4 
(vis. d64 
int vis, fcmpne32 (vis, d64 
( 
( 
( 
( 


int vis, fcmple32 


int vis, fcmplti6 (vis, d64 
int vis, fcmplt32 (vis, 4 
int vis, fcmpgeló6 (vis, 4 





int vis, fcmpge32 (vis, 4 


Description 


vis fcmplgt, le, eq, neq, lt, gel compare four 16-bit partitioned or two 32- 
bit partitioned fixed-point values within data1 4 16, 80181 2 32 and 
data2 4 16,data2 2 32. The 4-bit or 2-bit comparison results are returned in 
the corresponding least-significant bits of a 32-bit value, that is typically 
used as a mask. A single bit is returned for each partitioned compare and 
in both cases, bit 0 is the least-significant bit of the compare result. 
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For vis_fempgt(), each bit within the 4-bit עס‎ 2-bit compare result is set if 
the corresponding value of [data1 4 16, data1_2_32] is greater than the 
corresponding value of [data2_4_16, data2 2 32]. 


For vis fcmple(, each bit within the 4-bit or 2-bit compare result is set if 
the corresponding value of [data1. 4 16, data1_2_32] is less than or equal to 
the corresponding value of [data2 4 16, data2 2 32. 


For vis_fempeq(), each bit within the 4-bit or 2-bit compare result is set if 
the corresponding value of [data1 4 16, 00101 2 32] is equal to the 
corresponding value of [data2 4 16, data2 2 32]. 

For vis. fcmpne(Q, each bit within the 4-bit or 2-bit compare result is set if 
the corresponding value of [data1 4 16, 00101 2 32] is not equal to the 
corresponding value of [data2 4 16, data2 2 32]. 

For vis fcmpltO, each bit within the 4-bit or 2-bit compare result is set if 
the corresponding value of [datal 4 16, data1_2_32] less than the 
corresponding value of [data2 4 16, data2 2 32]. 

For vis fcmpge(0 each bit within the 4-bit or 2-bit compare result is set if 
the corresponding value of [datal, 4 16, datal 2 32] is greater or equal to 
the corresponding value of [data2 4 16, data2 2 32]. 

Figure 4-5 shows the four 16-bit pixel comparison operations. Figure 4-6 
shows the two 32-bit pixel comparison operations. 


31 
fcmp[gt, le, eg, ne, It, ge]16 


31 3 0 


Figure 4-5 16-bit Pixel Comparison Operations 
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63 1 0 
fcmpj[gt, le, eu, ne, It ge]32 
09182 2 32 
63 31 0 
mask 
31 10 


Figure 4-6 32-bit Pixel Comparison Operation 


Example 


int mask; 
vis_d64 datal 4 16, data2_4 16, datal 2 32, data2 2 32; 


mask = vis fcmpgtl6(datal 4 16, data2 4 10); 
/* datal 4 16 » data2 4 16 */ 


mask = vis fcmpleló6(datal 4 16, data2 4 10); 
/* datal 4 16 >= data2 4 16 */ 


mask = vis fcmpgeló6(datal 4 16, data2 4 10); 
/* datal 4 16 <= data2 4 16 */ 


mask = vis fcmpeql6(datal 4 16, data2 4 10); 
/* datal 4 16 == data2 4 16 */ 


mask = vis fcmpneló6(datal 4 16, data2 4 10); 
/* datal 4 16 != data2 4 16 */ 





mask = vis fcmpltl6(datal 4 16, data2 4 10); 
/* datal 4 16 > data2 4 16 */ 
































mask = vis fcmpgtl6(datal 4 16, data2 4 10); 
/* datal 4 16 » data2 4 16 */ 











/* mask may be used as an argument to a partial store instruction 
vis pst 8, vis pst 16 or vis pst 32*/ 

vis pst 16(datal 4 16, &data2 4 16, mask); 

/* Stores the greater 16-bit elements of datal 4 16 or data2 4 16 
overwriting data2 4 16 */ 
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4 6 Arithmetic Instructions 


The VIS arithmetic instructions perform partitioned addition, subtraction, or mul- 
tiplication. 


4.6.1 vis_fpadd[16, 16s, 32, 32s](), vis_fpsub[16, 16s, 32, 32s]() 


Function 


Perform addition and subtraction on two 16-bit, four 16-bit, or two 32-bit 
partitioned data. 


Syntax: 


vis_d64 vis, fpaddl6(vis d64 08581 4 16, vis d64 08582 4 16) 
vis d64 vis fpsubl6(vis d64 08581 4 16, vis d64 08582 4 16) 
vis d64 vis, fpadd32(vis d64 08581 2 32, vis d64 08582 2 32) 
vis d64 vis, fpsub32(vis d64 08581 2 32, vis d64 08582 2 32) 
vis 532 vis, fpaddló6s(vis 132 08581 2 16, vis 132 data2 2 16 
vis 532 vis, fpsublós(vis 132 08581 2 16, vis 132 data2 2 16 
vis 532 vis, fpadd32s (vis 132 08581 1 32, vis 132 data2 1 32 
vis 532 vis, fpsub32s(vis 132 08581 1 32, vis 132 data2 1 32 


, 


, 


, 





, 


) 
); 
) 
); 


Description 


vis fpadd160 and vis fpsub16() perform partitioned addition and 
subtraction between two 64-bit partitioned variables, interpreted as four 
16-bit signed components (datal 4 16 and 00102 4 16) and return a 64-bit 
partitioned variable interpreted as four 16-bit signed components, (sum_4_ 
16 or difference 4 16). vis fpadd320 and vis fpsub320 perform partitioned 
addition and subtraction between two 64-bit partitioned components, 
interpreted as two 32-bit signed variables (data1 2 32 and data2 2 32) and 
return a 64-bit partitioned variable interpreted as two 32-bit components 
(sum, 2 32 or difference 2 32). Overflow and underflow are not detected 
and result in wraparound. 


Figure 4-7 shows the vis fpadd160) and vis_fpsub16() operations. 
Figure 4-8 shows the vis fpadd320) and vis fpsub32(0 operation. 


The 32-bit versions interpret their arguments as two 16-bit signed values or 
one 32-bit signed value. The single precision version of these instructions: 


vis fpadd16sQ, vis fpsub16s0, vis fpadd32s0, vis fpsub32s0 


perform two 16-bit or one 32-bit partitioned adds or subtracts. 


Figure 4-9 shows the vis fpadd16s( and vis fpsub16s() operation. 
Figure 4-10 shows the vis fpadd32s( and vis fpsub32s0 operation. 
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vis_fpadd16() and vis_fpsub16() operation 
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data2_4 16 
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Figure 4-7 
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sum_2 32 or 


difference_2 32 LE. _ בר‎ 
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vis_fpadd32() and vis_fpsub32() operation 
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63 
Figure 4-6 
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data1 2 16 


sum 2 16 or 
difference 2 16 


31 
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Figure 4-9 vis_fpadd16s() and vis_fpsub16s() operation 


data1_1_32 
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data2_1_32 


wo 
= 
o 


sum_1_32 or 
difference_1_32 
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31 
Figure4-10 — vis fpadd32s() and vis_fpsub32s() 


Example 


vis d64 datal_4_16, data2_4 16, datal 2 32, data2 2 32; 

vis d64 sum 4 16, difference 4 16, sum 2 32, difference 2 32; 
vis f32 datal 2 16, data2 2 16, sum 2 16, difference 2 7 
vis f32 datal 1 32, data2 1 32, sum 1 32, difference 1 32; 


sum 4 16 = vis fpaddl6(datal 4 16, data2 4 16); 
difference 4 16 = vis fpsubl6(datal 4 16, data2 4 160); 
sum 2 32 = vis fpadda32(datal 2 32, data2 2 32); 
difference 2 32 = vis fpsub32(datal 2 32, data2 2 32); 
sum 2 16 = vis fpaddl6s (datal 2 16, data2 2 160); 
difference 2 16 = vis fpsubló6s(datal 2 16, data2 2 160); 
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sum 1 32 = vis 12800325 (datal 1 32, 08682 1 32); 
difference 1 32 = vis fpsub32s(datal 1 32, 88682 1 32); 


4.6.2 vis fmul8x160 


Function 


Multiply the elements of an 8-bit partitioned vis_f32 variable by the 
corresponding element of a 16-bit partitioned vis d64 variable to produce a 
16-bit partitioned vis d64 result. 


Syntax 


vis d64 vis, fmul8xl16(vis f32 pixels, vis d64 scale); 


Description 


vis fmul8x16() multiplies each unsigned 8-bit component within pixels by 

the corresponding signed 16-bit fixed-point component within scale and 

returns the upper 16-bits of the 24-bit product (after rounding) as a signed 

16-bit component in the 64-bit returned value. In other words: 

16-bit result = (8-bit pixel element*16-bit scale element + 128) 
/256 


Figure 4-11 shows this operation. 


This instruction treats the pixels values as fixed-point with the binary point 
to the left of the most-significant bit. For example, this operation is used 
with filter coefficients as the fixed-point scale value and image data as the 
pixels value. 


pixels 


scale 





msb msb msb msb 


result Y | Y Y 


63 47 31 15 0 

















Figure 4-11 | vis fmul8x16() Operation 
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Example 


vis_f32 pixels; 
vis_d64 result, scale; 


result = vis_fmul8x16 (pixels, scale); 


4.6.3 vis fmul8x16auQ, vis_fmul8x16al() 


Function 


Multiply the elements of an 8-bit partitioned vis f32 variable by one 
element of a 16-bit partitioned vis f32 variable to produce a 16-bit 
partitioned vis d64 result. 


Syntax 


vis d64 vis, fmul8x16au(vis f32 pixels, vis 132 scale); 
vis d64 vis, fmul8x16al(vis f32 pixels, vis 132 scale); 


Description 


vis fmul8x16au() multiplies each unsigned 8-bit value within pixels by a 
single 16-bit fixed-point component. The 16-bit fixed point component is 
the most-significant 16 bits of the 32-bit scale. The four pixel values in the 
32-bit variable pixels are each multiplied in the same way as vis fmul8x160 
described in section Section 4.6.2, "vis fmul8x16()," on page 50, except that 
the same 16-bit scale value is used for all four multiplications. 


Figure 4-12 shows the operation. vis fmul8x16al( is the same as vis. 
fmul8x16au(), except that the least-significant 16 bits of the 32-bit scale are 
used as a multiplier. Figure 4-13 shows the vis fmul8x16al() operation. 
Since vis fmul8x16au() uses the upper 16 bits of scale and vis fmul8x16al( 
uses the lower 16 bits of scale, two distinct scale values can be stored in scale. 
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Figure 4-12 vis fmul8x16au() operation 
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Figure 4-13 | vis fmul8x16al() operation 


Example 


vis f32 pixels, scale; 
vis d64 resultu, resultl; 


/* Most-significant 16 bits of scale multiply*/ 
resultu -vis fmul8x16au(pixels, scale); 
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/* Least-significant 16 bits of scale multiply*/ 
resultl - vis fmul8x6al(pixels, scale); 


4.6.4 vis fmul8sux160, vis fmul8ulx160 


Function 


Multiply the corresponding elements of two 16-bit partitioned vis d64 
variables to produce a 16-bit partitioned vis 464 result. 


Syntax 


vis d64 vis fmul8suxil6(vis d64 08581 16, vis d64 data2-16); 
vis d64 vis fmul8ulxil6(vis d64 08581 16, vis d64 data2 16); 


Description 


Both vis fmul8sux16() and vis fmul8ulx16() perform “half” a 
multiplication. fmul8sux16() multiplies the signed upper eight bits of each 
16-bit signed component of 80101 4 16 by the corresponding 16-bit fixed 
point signed component in data2 4 16. The upper 16 bits of the 24-bit 
product are returned in a 16-bit partitioned resultu. The 24-bit product is 
rounded to 16 bits. Figure 4-14 shows the operation. 


vis fmul8ulx16() multiplies the unsigned lower eight bits of each 16-bit 
element of 00181 4 16 by the corresponding 16-bit element in data2 4 16. 
Each 24-bit product is sign-extended to 32 bits. The upper 16 bits of the 
sign extended value are returned in a 16-bit partitioned resultl. Figure 4-15 
shows the operation. 


Because the result of fmul8ulx160 is conceptually shifted right eight bits 
relative to the result of fmul8sux16() they have the proper relative 
significance to be added together to yield 16-bit products 00/01 4 16 and 
00182 4 16. 


Each of the "partitioned multiplications" in this composite operation, 
multiplies two 16-bit fixed point numbers to yield a 16-bit result. In other 
words, the lower 16-bits of the full precision 32-bit result are dropped after 
rounding. The location of the binary point in the fixed point arguments is 
under the user's control. It can be anywhere from the right of bit 0 or to the 
left of bit 14. 


For example, each of the input arguments can have eight fractional bits: the 
binary point is between bit 7 and bit 8. If a full precision 32-bit result were 
provided, it would have 16 fractional bits: the binary point would be 
between bits 15 and 16. Since, however, only 16 bits of the result are 
provided, the lower 16 fractional bits are dropped after rounding. The 
binary point of the 16-bit result in this case is to the right of bit 0. 
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Another example, shown below, has 12 fractional bits in each of its two 
component arguments: the binary point is between bits 11 and 12. A full 
precision 32-bit result would have 24 fractional bits: the binary point 
between bits 23 and 24. Since, however, only a 16-bit result is provided, 
the lower 16 fractional bits are dropped after rounding, thus providing a 
result with eight fractional bits: the binary point between bits 7 and 8. 























0101.001010010101 (= 5.161376953125) 
x 0001.011001001001 (= 1.392822265625) 
00000111.00110000 (= 7.188880741596) 
63 55 47 39 31 23 15 7 0 
08181 4 16 
63 47 31 15 0 
data? 4 16 | | 
* * * * 
msb msb msb msb 
resultu Y Y Y Y 
63 55 47 39 31 23 15 7 0 


Figure 4-14 | vis fmul8sux16() operation 
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09181 4 6 


08182 4 6 | | 





|| uf | 
AU YY Y YY 


sign-extended _ sign-extended  sign-extended | sign-extended 
8 msb 8 msb 8 msb 8 msb 


resultl Y Y Y Y 


63 55 47 39 31 23 15 7 0 

















Figure 4-15 vis fmul8ulx16() operation 


Example 


vis d64 datal 4 16, data2 4 16, resultl, resultu, result; 

resultu = vis fmul8suxi16(datal 4 16, data2 4 16); 

resultl = vis fmul8ulxi16(datal 4 16, data2 4 16); 

result = vis fpaddl6(resultu, resultl);/* 16-bit result of a 16*16 
multiply */ 


4.6.5 vis fmuld8sux160, vis fmuld8ulx160 


Function 


Multiply a 16-bit partitioned vis f32 variable by a 16-bit partitioned vis 2 
variable to produce a 32-bit partitioned vis d64 result. 


Syntax 


vis d64 vis, fmuld8suxl6(vis, f32 08581681, vis, f32 datal6s2); 
vis d64 vis, fmuld8ulx16(vis, 132 08581681, vis, f32 datal6s2); 


Description 


vis fmuld8sux160 multiplies the upper eight bits of one 16-bit signed 
component of data16s1 by the corresponding signed 16-bit element of 
data16s2. Figure 4-16 shows the 32-bit result returned by shifting the 24-bit 
product left by eight bits. 
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data16s1 
31 23 15 7 0 
data16s2 
31 15 
* * 
result 00000000 +, 00000000 
63 31 0 


Figure 4-16 — vis fmuld8sux16() operation 
vis_fmuld8ulx16() multiplies the unsigned lower eight bits of each 16-bit compo- 


nent in data16s1 by the corresponding signed element in data16s2. Figure 4-17 
shows that each 24-bit product is returned as a sign-extended 32-bit result. 


data16s1 


31 2 15 7 


data16s2 | | 
31 זז‎ 5 Ad 0 


sign-extended _ sign-extended 


—À1 y 
result 


63 31 0 


o 











Figure 4-17 | vis fmuld8ulx16() operation 
vis_fmul8sux16() and vis_fmul8ulx16() together perform a true 16x16 -< 


32-bit multiplication, taking two vis_f32 arguments, each containing two 
16-bit signed values. As with vis_fmul8sux16() and vis_fmul8ulx16(), each 
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instruction computes “half” of the product, which when added together 
gives a 32-bit product. 


Example 
vis f32 datal6sl, datal6s2; 


vis d64 result 2680160, 12 


resultu = vis_fmuld8sux16(datal6sl, datal6s2); 
vis_fmuld8ulx16(datal6sl, datalós2); 
result = vis_fpadd32(resultu, resultl); 


resultl 


4.7 Pixel Formatting Instructions 


Pixel formatting instructions include packing instructions which convert 16-bit or 
32-bit data to a lower precision fixed or pixel format. Input values are clipped to 
the dynamic range of the output format. Packing applies a scale factor deter- 
mined from a scale factor field in the Graphics Status Register (GSR) to allow 
flexible positioning of the binary point. 


Pixel formatting instructions also include expand instructions that convert 8-bit 
elements to 16-bit elements and merge instructions that merge two independent 
pixel data elements into a 64-bit result. 


4.7.1 vis_fpack16() 


Function 


Truncates four 16-bit signed components to four 8-bit unsigned 
components. 


Syntax 
vis_f32 fpackl6(vis d64 data 4 16); 


Description 


vis_fpack16() takes four 16-bit fixed components within data 4 16, scales, 
truncates and clips them into four 8-bit unsigned components and returns 
a vis f32 result. This is accomplished by left shifting the 16-bit component 
as determined from the scale factor field of GSR and truncating to an 8-bit 
unsigned integer by rounding and then discarding the least-significant 
digits. If the resulting value is negative (meaning the MSB is set), zero is 
returned. If the value is greater than 255, then 255 is returned. Otherwise, 
the scaled value is returned. For an illustration of this operation see 4.7.2, 
“vis_fpack32(),” on page 59. 


Chapter4 VIS Instructions 57 


Note: The scale factor field of the GSR is 4 bits in the UltraSPARC I/II and 5 bits 
in the UltraSPARC III. vis_fpack16 () ignores GSR.scale<4> in the UltraSPARC 





III. 
data 4 16 
result 
4 0 3 0 4 0 3 0 
GSR.scale x1010 1010 GSR.scale x0100 0100 
15 VIS וו‎ VIS I 0 15 VIS וו‎ VIS I 0 


16-bit data 





8-bit pixel 
7 0 


Figure 4-18 vis_fpack16() operation 


Example 


vis d64 data 4 16; 


vis f32 result; 


result = vis fpackl6 (data 4 10); 
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4.7.2 vis fpack320 


Function 
Truncate two 32-bit fixed values into two unsigned 8-bit integers. 


Syntax 
vis d64 vis, fpack32(vis d64 data 8 8, vis d64 data 2 32); 


Description 


vis fpack32( copies its first argument (data 8 8 shifted left by eight bits) 
into the destination or vis d64 return value. It then extracts two 8-bit 
quantities (one each from the two 32-bit fixed values within data 2 32) and 
overwrites the least-significant byte position of the destination. Two pixels 
consisting of four 8-bit bytes each may be assembled by repeated operation 
of 

vis fpack32 on four data 2 32 pairs. 


The reduction of data 2 32 from 32 to eight bits is controlled by the scale 
factor of the GSR. The initial 32-bit value is shifted left by the 

GSR.scale factor, and the result is considered as a fixed-point number with 
its binary point between bits 22 and 23. If this number is negative, the 
output is clamped to 0; if greater than 255, it is clamped to 255. Otherwise, 
the eight bits to the left of the binary point are taken as the output. 


Another way to conceptualize this process is to think of the binary point as 
lying to the left of bit (22 - scale factor), in other words, (23 - scale factor) 
bits of fractional precision. The 4-bit scale factor can take any value 
between 0 and 15, inclusive. This means that 32-bit partitioned variables 
which are to be packed using vis_fpack32() can have between eight and 23 
fractional bits. 


Note: The scale factor field of the GSR is 4 bits in the UItraSPARC I/II and 5 bits 
in the UItraSPARC III. 


Example 
The following code example takes four variables red, green, blue, and 
alpha, each containing data for two pixels in a 32-bit partitioned format 
(r0r1, g0g1, 0001, a0a1), and produces a vis d64 pixels value containing 
eight 8-bit quantities (r0g0b0a0r1g1blal). 


vis d64 red, green, blue, alpha, pixels; 
/*red, green, blue, and alpha contain data for 2 pixels*/ 


pixels = vis, fpack32(red, pixels); 
pixels = vis fpack32(green, pixels); 
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pixels = vis_fpack32 (blue, pixels); 
pixels = vis_fpack32 (alpha, pixels); 
/* The result is two sets of red, green, blue and alpha values packed 


in pixels */ 


63 55 47 39 31 23 15 7 
data 2 32 PT 
data 8 8 
element 088818 2 2 y y y X y y y X 
0 
GSR.scale 00110 0110 
VIS II VIS I 






element of data 2 32 


37 


8-bit byte of result 


Figure 4-19 | vis fpack32() operation 
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4.7.3 vis_fpackfix() 


Function 
Converts two 32-bit partitioned data to two 16-bit partitioned data. 


Syntax 
vis_f32 fpackfix(vis d64 data 2 32,); 


Description 


vis fpackfix() takes two 32-bit fixed components within data 2 32, scales, 
and truncates them into two 16-bit signed components. This is 
accomplished by shifting each 32-bit component of data, 2. 32 according to 
GSR.scale-factor and then truncating to a 16-bit scaled value starting 
between bit 16 and bit 15 of each 32-bit word. Truncation converts the 
scaled value to a signed integer (meaning it rounds toward negative 
infinity). If the value is less than -32768, then -32768 is returned. If the 
value is greater than 32767, then 32767 is returned. Otherwise the scaled 
data 2 16 value is returned. Figure 4-20 shows the vis fpackfix( operation. 


Note: The scale factor field of the GSR is 4 bits in the UItraSPARC I/II and 5 bits 
in the UItraSPARC III. 


Example 


vis, d64 data 2 32; 
vis 132 data 2 16; 


data 2 16 = vis fpackfix(data 2 32); 
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3 31 


data 2 32 








data 2 16 
31 15 0 
GSR.scale 00110 0110 
VIS Il VIS I 
data 2 32 
component 
31 16 15 5 0 
00 0 
37 
data_2 16 
component 
15 0 


Figure 4-20  vis_fpackfix() operation 


4.7.4 vis_fexpand() 


Description 


Converts four unsigned 8-bit elements to four 16-bit fixed elements. 
Syntax 


vis d64 vis_fexpand(vis_f32 data 4 8); 


Description 


vis fexpand0 converts packed format data. For example it can convert raw 
pixel data to a partitioned format. vis fexpand() takes four 8-bit unsigned 
elements within data 4 8, converts each integer to a 16-bit fixed value by 
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inserting four zeroes to the right and to the left of each byte, and returns 
four 16-bit elements within a 64-bit result. Since the various vis fmul8x160) 
instructions can also perform this function, vis fexpand() is mainly used 
when the first operation to be used on the expanded data is an addition or 
a comparison. Figure 4-21 shows the vis_fexpand() operation. 


data_4 8 


63 á 
result_4_16 











data_4_8 component 


result 4 16 component 





Figure 4-21 ^ vis fexpand() operation 


Example 


vis d64 result 4 16; 
vis f32 data 4 8, factor; 


result 4 16 - vis fexpand(data 4 8); 
/*Using vis fmul8x16al to perform the same function*/ 


factor = vis to float (0x0100); 
result 4 16 = vis fmul8x16al(data 4 8, factor); 


4.7.5 vis fpmerge() 


Function 


Merges two 8-bit partitioned vis 432 arguments by selecting bytes 
alternatively from each. 
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Syntax 


vis d64 vis, fpmerge(vis f32 pixelsl, vis f32 pixels2) 


Description 


vis fpmerge( interleaves four corresponding 8-bit unsigned values within 
pixels1 and pixels2 to produce a 64-bit merged result. Figure 4-22 shows the 
operation. 


pixels? 


pixels2 





mergeresult 





63 56 47 39 31 23 15 7 0 


Figure 4-22 vis fpmerge() operation 


Example 


vis u32 pixelsl = 0x00112233; 
Vis u32 pixels2 = 
vis f32 d, e; 

vis d64 mergeresult; 


d = vis to float (pixels1); 

e = vis to float (pixels2); 
mergeresult = vis fpmerge(d, e); 

/* mergeresult= 0x00aallbb22cc33dd */ 


4.7.6 vis alignaddrO, vis. faligndata() 


Function 


Calculate 8-byte aligned address and extract an arbitrary eight bytes from 
two 8-byte aligned addresses. 
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Syntax 


void *vis_alignaddr(void *addr, int offset); 
vis d64 vis_faligndata(vis_d64 data hi, vis_d64 data 10); 


Description 


vis alignaddr() and vis faligndata() are usually used together. 

vis alignaddr() takes an arbitrarily-aligned pointer addr and a signed 
integer offset, adds them, places the rightmost three bits of the result in the 
address offset field of the GSR, and returns the result with the rightmost 
three bits set to 0. This return value can then be used as an 8-byte aligned 
address for loading or storing a vis d64 variable. Figure 4-23 shows an 
example. 


aligned boundary address of destination data = falignaddr(da, offset) 


Y 
| 


dp = x10000 


x10008 
da = x10005 Data Start Address 


vis_alignaddr(x10005, 0) returns x10000 with five placed in the GSR offset field. 





vis_alignaddr(x10005, -2) returns x10000 with three placed in the GSR offset field. 


Figure 4-23 vis alignaddr() example. 


vis faligndata() takes two vis d64 arguments data hi and data. lo. It 
concatenates these two 64-bit values as data, hi, which is the upper half of 
the concatenated value, and data. lo, which is the lower half of the 
concatenated value. Bytes in this value are numbered from most-significant 
to the least-significant with the most-significant byte being 0. Figure 4-24 
shows that the return value is a vis d64 variable representing eight bytes 
extracted from the concatenated value with the most-significant byte 
specified by the GSR offset field, where it is assumed that the GSR address 
offset field has the value five. 
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aligned boundary 
data_hi data_lo 


| Offset | 


x10000 x10008 
x10005 


vis_faligndata(data_hi, data_lo) returns the shaded data segment. 


Figure 4-24 | vis faligndata() example. 


Care must be taken not to read past the end of a legal segment of memory. 
A legal segment can begin and end only on page boundaries; and so, if any 
byte of a vis d64 lies within a valid page, the entire vis d64 must lie within 
the page. However, when addr is already 8-byte aligned, the GSR address 
Offset bits are set to 0 and no byte of data, lo is used. Therefore, although it 
is legal to read eight bytes starting at addr, it may not be legal to read 16 
bytes, and this code will fail. You can avoid this problem in one of the 
following ways: 


* addr can be compared with some known address of the last legal byte; 

* The final iteration of a loop, which may require reading past the end of 
the legal data, can be special-cased; 

* Slightly more memory than required can be allocated to ensure that 
valid bytes are available after the end of the data. 


Example 


The following example shows how these instructions can be used together 
to read a group of eight bytes from an arbitrarily-aligned address 'addr', as 
follows: 


void *addr; 
vis d64 *addr aligneg; 
vis d64 data hi, data lo, data; 


addr aligned = (vis d64*) 
data hi = addr aligned[0]; 
data lo = addr aligned[1]; 

data = vis faligndata(data hi, data 10); 


vis alignaddr(addr, 0); 


When data are being accessed in a stream, it is not necessary to perform all 
the steps shown above for each vis d64. Instead, the address may be 
aligned once and only one new vis, d64 read per iteration: 
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addr aligned = (vis d64*) vis alignaddr(addr, 0); 
data hi = addr aligned[0]; 
for (i = O; i > times; ++i) { 

data 10 = addr aligned[i + 1]; 

data = vis faligndata(data hi, data 10); 

/* Use data here. */ 


/* Move data "window" to the right. */ 
data hi = data 10; 


The same considerations concerning "read ahead" apply here. In general, it 
is best not to use vis alignaddr() to generate an address within an inner 
loop, for example: 


( 

addr aligned = vis alignaddr(addr, offset); 
data hi = addr aligned[0]; 

offset += 8; 

| haere 6 

} 


The data cannot be read until the new address has been computed. Instead, 
compute the aligned address once, and either increment it directly or use 
array notation. This will ensure that the address arithmetic is performed in 
the integer units in parallel with the execution of the VIS instructions. 


4.7.7 vis edgel8, 16, 32]() 


Function 


Compute a mask used for partial storage at an arbitrarily aligned start or 
stop address. Instructions are typically used to handle boundary conditions 
for parallel pixel scan line loops. 


Syntax 


/* Pure edge handling instructions */ 

vis, s32 vis, edge8 (void *adressl, void *adress2); 
vis, 532 vis, edgel6(void *adressi, void *adress2); 
vis 532 vis, edge32(void *adressi, void *adress2); 

/* Little endian version of pure edge handling instructions*/ 
vis 532 vis, edge8l(void *adressi, void *adress2); 

vis 532 vis, edgel6l(void *adressi, void *adress2); 

vis 532 vis, edge32l(void *adressi, void *adress2); 
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/* Edge handling instructions which do not set the 
integer condition codes */ 

vis_s32 vis, edge8n(void *adressl, void *adress2); 
vis, 532 vis, edge8ln(void *adressl, void *adress2); 
vis 532 vis, edgeló6n(void *adressl, void *adress2); 
vis, 532 vis, edgel6ln(void *adressl, void *adress2); 
vis s32 vis, edge32n(void *adressl, void *adress2); 
vis, 532 vis, edge321n(void *adressl, void *adress2); 


Description 


vis edge80, vis edge16() and vis edge320 compute a mask to identify 
which (8-bit, 16-bit, or 32-bit) components of a vis d64 variable are valid 
for writing to an 8-byte aligned address. vis edgel[8, 16, 3210 are typically 
used with a partial store instruction. Partial stores always start to write at 
an 8-byte aligned address. An application, however, may be designed to 
start writing at an arbitrary address that is not 8-byte aligned. This requires 
a mask. For example, if you want to start writing data at address 0x10003 
(the partial store), then using a partial store instruction as described in the 
next section starts writing at address 0x10000 and the mask [00011111] 
disables the writes to 0x10000, 0x10001, and 0x10002, and enable writes 
to 0x10003, 0x10004, 0x10005, 0x10006, and 0x10007 


vis edge[8,16,32]( accepts two addresses (address1 and address2), where 
address1 is the address of the next pixel to write, and address2 is the address 
of the last pixel in the scanline. These instructions compute two masks: a 
left edge mask and a right edge mask. The left edge mask is computed 
from the three least-significant bits (LSBs) of address1. The right edge mask 
is computed from the three LSBs of address2, according to Table 4-2 or, for 
little-endian byte ordering, Table 4-3. 

vis edge[8,16,32](I)n are the new edge instructions introduced with VIS 
2.0. They have the same functionality as the original edge instructions but 
do not set the integer condition codes, allowing them to be grouped with 
other instructions. 


Note: For VIS 2.0 and later, vis edge[8,16,32] [1] () are redefined as the 
aliases of vis edge[8,16,32] [1]n (), so that users can use the same source 
code but take the advantage of the new edge instructions. 
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Table 4-2 Edge Mask Specification 


Left Edge Right Edge 
































Table 4-3 | Edge Mask Specification (Little-endian) 


Left Edge Right Edge 
































They then zero out the three least-significant bits of address1 and address2 to 
get 8-byte aligned addresses, meaning address1 & (~7), address2 & (~7). If 
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the aligned addresses differ, then the left edge mask is returned; if they are 
the same, then the result of the bitwise ANDing of the left and right edge 
masks is returned. Note that if the aligned addresses differ and address1 is 
greater than address2, then the edge instructions still return the left edge 
mask, which in almost all cases is not desirable. When the aligned 
addresses differ, it is best to keep address1 less than or equal to address2. 


The little-endian versions vis_edge[81, 161, 321]() compute a mask that is bit 
reversed from the big endian version. 


The following examples show the handling of data boundaries by the two 
functions, vis_inverse8a() and vis_inverse_8b(), that lead to identical 
results, but differ in the way that they handle the starting point. 


vis_inverse_8b() never accesses data beyond the 8-byte aligned start 
address. Such access occurs with vis_inverse8a() when the offset in the 
destination address alignment is larger than the offset in the source 
address alignment. vis_inverse8b() uses one additional vis_ 
alignaddr/vis_faligndata pair to deal with the offset of address alignment 
in the destination. This is a “safer” approach than vis_inverse8a. 

Figure 4-25 shows start point handling by the function vis_inverse8a and 
Figure 4-26 shows start point handling by the function vis_inverse_8b. 





vis_alignaddr off | 


vis_faligndata mE o 
INVERSE | 
|| o 


vis_pst_8 off 





dp dst dp+1 dp+2 
emask=00111111 


Figure 4-25 Start Point Handling in vis_inverse8a() 
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—1i / 
— / 
vis_alignaddr | | 
vis_faligndata A 
vis_alignaddr - m "pz 
vis faligndata we a 
vis_pst_8 
-- 
-- 
| | | 
dp dst dp+1 dp+2 
emask=00111111 
Figure 4-26 Start Point Handling in vis invers8b() 
Examples 
/ * 
* FUNCTION 
* vis inverse8a(), vis inverse8b() invert an array of 8-bit data 
* 
* SYNOPSIS 


* void vis inverse8a (vis u8 *src, vis u8 *dst, int num); 


* void vis inverse8b (vis u8 *src, vis u8 *dst, int num); 


* ARGUMENT 
* src pointer to first byte of source data 
* dst pointer to first byte of destination data 





* num length of arrays 


* DESCRIPTION 

* dst[i] = 255 - src[i], 0 >> i > num 
*f 

#include <stdlib.h> 

#include "vis_types.h" 





finclude "vis proto.h" 
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Code Example 4-1 Data Boundary Handling By vis_inverse8a() 


void vis_inverse8a (vis_u8 *src, vis_u8 *dst, int length) 





vis_u8 *sa = src; /* start point in source */ 

vis_d64 *sp; /* 8-byte aligned start point in source */ 
vis u8  *da = dst; /* start point in destination */ 

vis u8  *dend, *dend2;/* end point in destination */ 

vis d64 *dp; /* 8-byte aligned start point in destination */ 
int off; /* offset of address alignment in destination */ 
int emask; /* edge mask */ 

vis d64 s, sl, 7 /* source data */ 

vis d64 d; /* destination data */ 





/* prepare destination address */ 
dp = (vis d64 *) ((vis addr) da & (~7)); 


off = (vis addr) dp - (vis addr) da; 

dend = da + length - 1; /* pointer to the last byte of data. */ 

dend2 = dend - 8; /* pointer to the last byte which */ 
/* doesn't need edge handling. Af 


/* generate edge mask for start point */ 
mask = vis_edge8(da, dend); 


/* prepare source address and set GSR alignaddr offset */ 
sp = (vis_d64 *) vis_alignaddr(sa, off); 


/* load 8 bytes of source data */ 


50 = *sp; 
sp tt; 
sl = *sp; 


S = vis_faligndata(s0, sl); 


/* 8-pixel inversion */ 
d = vis_fnot(s); 


/* store 8 bytes of result */ 
vis_pst_8(d, dp, emask) ; 








/* set edge mask to 11111111, so all 8 bytes of data */ 
/* will be saved in vis_pst_8() doing while-loop. */ 
emask = 7 


/* 8-byte loop */ 
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while ((vis_u32) dp <= (vis_u32) dend2) { 


/* load 8 bytes of source data */ 
sl = *sp; 
S = vis faligndata(s0, sl); 


/* 8-pixel inversion */ 
d = vis fnot(s); 


/* store 8 bytes of result */ 
vis pst 8(d, dp, emask); 





/* generate edge mask for end point */ 
mask = vis  edge8 (dp, dend); 





/* load 8 bytes of source data */ 
sl = *sp; 
S — vis faligndata(s0, s1); 


/* 8-pixel inversion */ 
d = vis fnot(s); 


/* store 8 bytes of result */ 
vis pst 8(d, dp, emask); 
} 
Code Example 4-2 Data Boundary Handling by vis_inverse8b() 


void vis inverse8b (vis u8 *src, vis u8 *dst, int length) 


vis u8 *sa = src; /* start point in source */ 

vis d64 *sp; /* 8-byte aligned start point in source */ 
vis, ₪8 *da = dst; /* start point in destination */ 

vis 8ט‎  *dend, *dend2; /* end point in destination */ 

vis. d64 *dp; /* 8-byte aligned start point in destination */ 
int off; /* offset of address alignment in destination */ 
int emask; /* edge mask */ 

vis d64 s, sl, s0; /* source data */ 

vis d64 d; /* destination data */ 








/* prepare destination address */ 

dp = (vis d64 *) ((vis addr) da & (~7)); 

off = 8 - ((vis addr) da & 7); 

dend = da + length - 1; /* pointer to the last byte of data. */ 
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dend2 = dend - 8; /* pointer to the last byte which */ 
/* doesn't need edge handling. Af 


/* generate edge mask for start point */ 
mask = vis_edge8(da, dend); 


/* prepare source address and set GSR alignaddr offset */ 
sp = (vis_d64 *) vis_alignaddr(sa, 0); 


/* load 8 bytes of source data */ 


50 = *sp; 
sp tt; 
sl = *sp; 


S = vis_faligndata(s0, sl); 


/* 8-pixel inversion */ 
d = vis_fnot(s); 


/* store 8 bytes of result */ 
vis_alignaddr((void *) off, 0); 
vis pst 8(vis faligndata(d, d), dp, emask); 


s0 = sl; 
sa += off; 
dp ++; 





/* prepare source address and set GSR alignaddr offset */ 
sp = (vis_d64 *) vis_alignaddr(sa, 0); 


/* set edge mask to 11111111, so all 8 bytes of data */ 
/* will be saved in vis_pst_8() doing while-loop. */ 
emask = 7 


/* 8-byte loop */ 
while ((vis_u32) dp <= (vis_u32) dend2) { 


/* load 8 bytes of source data */ 
sl = *sp; 


S = vis_faligndata(s0, sl); 


/* 8-pixel inversion */ 
d = vis_fnot(s); 


/* store 8 bytes of result */ 
vis_pst_8(d, dp, emask); 


50 = sl; 
sp tt; 
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dp ++; 
} 


/* generate edge mask for end point */ 
mask = vis, 600868 (dp, dend); 


/* load 8 bytes of source data */ 
sl = *sp; 
S = vis faligndata(s0, sl); 


/* 8-pixel inversion */ 
d = vis fnot(s); 


/* store 8 bytes of result */ 
vis pst 8(d, dp, emask); 


4.8 Load and Store 


4.8.1 Partial Store Instructions 


Function 
Write mask enabled 8-bit, 16-bit, and 32-bit components from a vis d64 
value to memory. 


Syntax 

void vis pst 8(vis d64 data, void *address, vis u8 mask); 
void vis pst 15ט)16‎ d64 data, void *address, vis u8 mask); 
) 


void vis pst 32(vis d64 data, void *address, vis u8 mask); 


Description 
vis pst [8, 16, 32]( use mask, typically determined by edge or compare 
instructions to control which 8-bit, 16-bit, or 32-bit components of data are 
to be written to memory. Typical uses include writing only selected 
channels of a multi-channel image, avoiding writing past image 
boundaries, and selecting between images on a pixel-by-pixel basis based 
on the result of a comparison instruction. 


Example 


Code Example 4-3 Creation of Mask That Allows for an Unaligned Store 


vis d64 *addr, *addr last, *addr aligneg; 
vis d64 data; 


Chapter4 VIS Instructions 75 








IAE emask; 


mask = vis_edge8 (addr, addr_last); 
addr_aligned = vis_alignaddr(addr, 0); 
vis_pst_8 (data, addr aligned, emask); 


Code Example 4-4 Loop that Writes Zeroes to a Span of Bytes 


vis d64 *addr, *addr last, *addr aligneg; 
vis d64 zero; 
int emask; 


zero - vis fzero(); 
addr aligned = vis alignaddr(addr, 0); 
mask = vis, edge8(addr, addr last); 
while ((vis addr) addr aligned <= (vis addr) addr last) { 
vis pst 8(zero, addr aligned, emask); 
addr aligned ++; 
mask = vis, edge8(addr aligned, addr last); 
} 


Code Example 4-5 Same Function as the Loop in Code Example 4-4 Except Using an 
Explicit Loop Counter. 
vis_d64 *addr, *addr_last, *addr_aligned; 
vis_d64 zero; 
int emask, times; 


zero = vis_fzero(); 
addr_aligned = vis_alignaddr(addr, 0); 
mask = vis, edge8 (addr, addr_last); 
times = ((vis_addr) addr_last >> 3) - ((vis_addr) addr >> 3) + 1; 
for (i = O; i > times; i ++) ( 
vis_pst_8(zero, addr_aligned, emask); 
addr_aligned ++; 
mask = vis_edge8(addr_aligned, addr last); 


Note: If there are memory mapped devices in your system and you are using 
the partial store instruction vis_pst_[8,16,32] () (described in section 
Section 4.8.1, “Partial Store Instructions,” on page 75) to store data in memory 
locations into which the device is mapped, then this operation will only work if 
the device is “cached”. The partial store is a read-modify-write operation and will 


not work for “non-cached” memory mapped devices. For example, it will not 
work across the S-Bus. 
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4.8.2 Byte/Short Loads and Store Instructions 


Function 


Perform 8-bit and 16-bit loads and stores to and from floating-point 
registers. 


Syntax 
/*Short Stores*/ 
void vis st u8(vis d64 data, void *address); 
void vis st u8 i(vis d64 data, void *address, long index); 
void vis st ul6(vis d64 data, void *address); 
void vis st ul6 i(vis d64 data, void *address, long index); 
void vis st u8 le (vis d64 data, void *address); 
void vis st ul6 le(vis d64 data, void *address); 


/* Short loads */ 

vis d64 vis ld u8(void *address); 

vis d64 vis ld u8 i(void *address, long index); 
vis d64 vis ld ul6(void *address); 

vis d64 vis ld u16_i(void *address, long index); 
vis d64 vis ld u8 le(void *address); 

vis d64 vis ld ₪016 le(void *address); 


Description 


vis ld u[8, 8 i, 16, 16 i] and vis st u[8, 8 i, 16, 16 i] perform 8-bit and 16- 
bit loads or stores to and from 64-bit variables. Bytes and shorts may be 
loaded to and stored from the floating-point register file. Bytes may be 
loaded from and stored to arbitrary addresses, and shorts from/to even 
addresses. Instructions with the _i suffix add index to address just prior to 
loading from or storing to memory. vis ld u[8 le, 16 le] and vis. st 8]ט‎ . 
le, 16 le] perform the same function, but use the little endian addressing 
convention. 


A common trick uses vis faligndata() and vis [ld, st] u80 to read a series 
of noncontiguous bytes, accumulate them into a vis. d64, and store them all 
at once. This trick can almost double the speed of some memory-bound 
loops. 


Example 


vis u8 *addr0, *addrl, *addr2, *addr3; 

vis u8 *addr4, *addr5, *addr6, *addr7; 

vis d64 val0, vall, val2, val3, val4, val5, val6, val7, accum; 
vis d64 *output; 


vis alignaddr((void *) 0, 7); 
accum = vis, fzero(); 
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for (;;) + 























/* Generate 86620, ..., addr7 somehow. */ 

val0 = vis_ld_u8(addr0); 

vall = vis. ld u8 (addr1); 

val2 = vis, ld u8 (addr2); 

val3 = vis, ld u8 (addr3); 

val4 - vis. ld u8 (addr4); 

val5 = vis, ld u8 (addr5); 

val6 = vis, ld u8(addr6); 

val7 = vis ld u8(addr7); 

accum = vis faligndata(val7, accum); 
accum = vis faligndata(val6, accum); 
accum = vis faligndata(val5, accum); 
accum = vis faligndata(val4, accum); 
accum = vis faligndata(val3, accum); 
accum = vis faligndata(val2, accum); 
accum = vis faligndata(vall, accum); 
accum = vis faligndata(val0, accum); 
*outputt+ = accum; 


4.8.3 Block Load and Store Instructions 


Function 


Transfer 64 bytes of data between memory and registers. 


Syntax 


The Block Load and Store instructions do not have a C interface and must 
be coded in assembly language. For assembly language syntax refer to 
“Section 13.6.4 Block Load and Store Instructions” in the UltraSPARC User’s 
Manual. 


Description 


The block load instruction loads 64 bytes of data, with a block transfer, 
from a 64-byte aligned memory area into eight double-precision floating- 
point registers. 


The block store instruction stores data, with a block transfer, from eight 
double-precision floating-point registers to a 64-byte aligned memory area. 
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Example 


Note that the loop must be unrolled to achieve maximum performance. All 
FP registers are double-precision. Eight versions of this loop are needed to 
handle all the cases of double word misalignment between the source and 























destination. 

loop: 
faligndata £d0, $d2, $d34 
faligndata £d2, $d4, $d36 
faligndata $d4, $d6, $d38 
faligndata £d6, $d8, $d40 
faligndata $d8, $dl0, $d42 
faligndata £d10, $d12, $d44 
faligndata $d12, $d14, $d46 
addcc JO eig 
bg,pt 11 
fmovd $d14, $d48 
(end of loop handling) 

18 [regaddr] ASI BLK P, 0 
stda £d32, [regaddr] ASI BLK P 
faligndata £d48, %016, 2 
faligndata $d16, $d18, $d34 
faligndata $d18, $d20, $d36 
faligndata $d20, $d22, $d38 
faligndata 5022, $d24, %d40 
faligndata $d24, $d26, $d42 
faligndata $d26, $d28, $d44 
faligndata $d28, $d30, $d46 
addcc be. ly EU 
be,pnt done 
fmovd $d30, $d48 
ldda [regaddr] ASI_BLK_P, %d16 
stda $d32, [regaddr] ASI BLK P 
ba loop 
faligndata $d48, 500, 2 

done: (end of loop processing) 


See also Section 5.2.8, “Using VIS Block Load and Store Instructions,” on 
page 95.” 


4.9 Array Instructions 


vis_arrayl8, 16,32]0 
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Function 


Translate fixed-point (x,y,z) coordinates into a memory address in a data 
set formatted in a blocked fashion. 


Syntax 


vis addr vis_array8(vis_u64 datal, vis_s32 data2) 


, 


Description 


The array instructions facilitate 3D texture mapping and volume rendering 
by computing a memory address for data lookup based on fixed-point x, y, 
and z coordinates. The data are laid out in a blocked fashion, so that points 
which are near one another have their data stored in nearby memory 
locations. 


If the texture data were laid out in the obvious fashion (the z=0 plane, 
following by the z= 1 plane, and so on), then even small changes in z 
would result in references to distant pages in memory. The resulting lack of 
locality would tend to result in TLB misses and poor performance. The 
three versions of the array instruction vis_array8(), vis array160, and vis_ 
array32() differ only in the scaling of the computed memory offsets. vis_ 
arrayl6() shifts its result left by one position and vis array320 shifts left by 
two in order to handle 16-bit and 32-bit texture data. 


When using the array instructions, a “blocked-byte” data formatting 
structure is imposed. The N x א‎ x M volume, where N = 27% 64, M- m x 
32, 0 < n <5, 1 > m > 16 should be composed of 64 x 64 x 32 smaller 
volumes, which in turn should be composed of 4 x 4 x 2 volumes. This 
data structure is optimal for 16-bit data. For 16-bit data, the 4 x 4 x 2 
volume has 64 bytes of data, which is ideal for reducing cache-line misses; 
the 64 x 64 x 32 volume will have 256k bytes of data, which is good for 
improving the TLB hit rate. Figure 4-27 shows how the data has to be 
organized, where the origin (0,0,0) is assumed to be at the lower left front 
corner and the x coordinate varies faster than y than z. In other words, 
when traversing the volume from the origin to the upper-right back, go 
from left to right, front to back, and bottom to top. 
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; 
vis_addr vis_arrayl6(vis_u64 datal, vis_s32 data2); 
vis_addr vis_array32(vis_u64 datal, vis_s32 data2) 
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Mzm x 2 


N22" x 64 





16x2-32 
Quee 16 x 4=64 


16 x 4=64 N22? x 64 


Figure 4-27 _ Block-Byte Data Formatting Structure 


The array instructions have two inputs: 


1.The (x,y,z) coordinates are input via a single 64-bit integer organized as 
shown in Figure 4-28. 





Figure 4-28 3D Array Fixed-Point Address Format 


Note that z has only nine integer bits as opposed to 11 for x and y. Also note that 
since (x,y,z) are all contained in one 64-bit register, they can be incremented si- 
multaneously by using a 64-bit addition/subtraction, thus providing a significant 
performance boost. 
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2. The X, Y size of the א א‎ N x M volume. Use the following table for the size 
specification: 


Number of 
Elements 








So for a 512 x 512 x 32 or a 512 x 512 x 256 volume, you will input a size value of 
three. Note that the X and Y size of the volume have to be the same. The z size of 
the volume is a multiple of 32 ranging between 32 and 512. 


The array instructions output an integer memory offset, that when added to the 
base address of the volume, gives you the address of the voxel and can be used 
by a load instruction. The offset is correct, only if the data has been reformatted 
as specified above. The output is formatted as shown in Figure 4-29 for array8, 
Figure 4-30 for array16, and Figure 4-31 for array32. 





20 17 17 
-2n + ח2‎ +n 


Figure 4-29 3D Array Blocked Address Format (array8) 





21 18 18 
+2n +2n +n 


Figure 4-30 3D Array Blocked Address Format (array16) 
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middle 





22 19 19 
+2n +2n +n 


Figure 4-31 3D Array Blocked-Address Format (array32) 


See the example in 5.2.9, “Using array8 With Assembly Code,” on page 100, to 
see how the array8, the load and the add/sub instructions are used and grouped 
together for maximum throughput. The grouping takes into consideration the la- 
tencies of the different instructions. In other words, the load, ldda, following the 
array8, does not load the voxel just addressed by the array8 in its grouping, but 
rather the voxel addressed by array8 in the previous grouping. 


The array instructions operate on all 64 bits of an integer register. Solaris 2.5 al- 
lows all 64 bits of the registers 502-504 and $00-$07 to be used; other registers 
cannot be relied on to retain their upper 32 bits. Since the current SPARCompiler 
4.x has limited support for 64-bit integer operations, the array instructions might 
not be accessed efficiently from C. For a coding example, see 5.2.9, "Using array8 
With Assembly Code,” on page 100. 


Note: In the 32-bit mode, vis u64 is the same as unsigned long long, 
which makes vis array[8,16,32] () not strictly conform to ANSI/ISO C 
standard. 
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4.10 Pixel Distance Instructions: vis pdist() 


Function 


Compute the absolute value of the difference between two pixel pairs: 
between eight pairs of vis u8 components 


Syntax 


vis d64 vis pdist(vis d64 pixelsi, vis d64 pixels2, vis, 4 
accumulator); 


Description 


vis pdist() takes three double-precision arguments pixels1, pixels2 and 
accum. pixels1 and pixels2 contain eight pixels each in raw format. The 
pixels are subtracted from one another, pair wise, and the absolute values 
of the differences are accumulated into accum. Note that the destination 
register is a double-precision floating-point register, which contains an 
integral value. 


To use vis pdist() from C, it is necessary for the accumulating register 
accumulator to appear both as an argument and as the receiver of the return 
value. 


The vis pdist() instruction is intended to accelerate motion compensation 
to support real-time video compression in such applications as H.320 video 
conferencing. 


Example 


vis d64 accum, pixelsl, pixels2; 


accum = vis, fzero(); 
accum = vis pdist(pixell, pixel2, accum); 


84 VIS Instruction Set User's Manual * May, 2001 


4.11 Byte Mask and Shuffle Instructions: 
vis read bmask(, vis write bmask(), vis_bshuffle() 


Function 


Read/write the GSR.mask field and extract 8 bytes from 16 bytes based on 
the value of GSR.mask. 


Syntax 


vis, u32 vis, read bmask(); 
void vis write bmask(vis u32 maski, vis u32 mask2); 


vis d64 vis bshuffle(vis d64 pixelsl, vis d64 pixels2); 
Description 


vis read bmask() returns GSR.mask. 


vis write bmask() adds two unsigned integer variables, mask1 and 
mask2, and stores the least significant 32 bits of the result in the GSR. mask 
field. 


vis bshuffle() concatenates the two 64-bit floating-point variables specified 
by pixels1 (more-significant half) and pixels2 (less significant half) to form 
a 16-byte value. Bytes in the concatenated value are numbered from most 
significant to least significant, with the most significant byte being byte 0. 
vis bshuffle() extracts 8 of those 16 bytes and stores the result in the 64-bit 
floating-point variable. Bytes in result are also numbered from most to 
least significant, with the most significant being byte 0. The following table 
indicates which source byte is extracted from the concatenated value for 
each byte in result. 





Destination Byte (in result) Source Byte 

0 (most significant) 6181א01)‎ | | pixels2)[GSR.mask<31:28> 
1 (pixelsl pixels2)[GSR.mask<27:24> 
2 (pixelsl pixels2)[GSR.mask<23:20> 
3 (pixelsl pixels2)[GSR.mask<19:16> 
4 (pixelsl pixels2)[GSR.mask<15:12> 
5 (pixelsl || pixels2)[GSR.mask<11:8>] 
6 (pixelsl | | pixels2)[GSR.mask<7:4>] 
7 (least significant) (pixelsl | | pixels2)[GSR.mask<3:0>] 























Chapter4 VIS Instructions 85 



























































These new instructions are only available with VIS 2.0 or later. 


byte5 byte6 byte7 







0x89abcdef); 
0x76543210); 


Example 


vis, d64 sdl, 502, 


unsigned int  bmask; 


bmask = 0xB89A7456; 
vis write bmask(0, bmask); 


bmask = vis, read bmask(); 


sdl = vis to double(0x01234567, 
sd2 = vis to double (0xfedcba98, 


dd = vis bshuffle(sd1, sd2); 


byte0 bytel byte2 byte3 byte4 


sd1 
63 56 47 39 
byte8 byte9 
sd2 
63 
bmask 


63 56 47 39 


Figure 4-32 vis_bshuffle() operation 
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Note: 


0xB89A7456 


dd AX צ| צ|‎ 


6 


























Code Examples 5 


5.1 Chapter Overview 


This chapter describes sample programs that show the use of the VIS instruction 
set. It shows examples from the following major application areas: 


* Imaging 
* Graphics 
e Audio 

e Video 


5.2 Simple Examples 


The following are some code examples illustrating the application of the VIS in- 
struction set. 


5.2.1 Averaging Two Images 


void 
ave (vis d64 inputsO[], vis_d64 inputs1[], 
vis d64 outputs[], int times) 
( 
int i; 
vis d64 input0, inputl; 
vis d64 result hi, result 1o; 


vis write gsr(2 «« 3); /* Set shift field of gsr to 2. */ 
for (i = O; i > times; ++i) { 
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input0 inputsO[i]; 

inputl = inputsl[i]; 

result hi = vis, fpadd16 (vis, fexpand (vis, read hi (input0 
vis fexpand(vis, read hi (inputl 

result lo = vis, fpaddl16 (vis, fexpand (vis, read lo(input0 
vis fexpand (vis, read lo(inputl1 


), 
))); 
), 

))); 


outputs[i] = vis freg pair(vis fpackl6(result hi), 
vis fpackl6(result 10)); 


5.2.2 Blending Two Images by a Fixed Percentage 


void 
blend (vis, d64 inputs0[], vis d64 inputsl[], vis. d64 outputs[], 
int percent, int times) 


vis u32 coeff hi, coeff lo; 

vl f32 coefficients; 

vis d64 input0, inputl, blend0, blendl; 
vl f32 result hi, result 1o; 

int i; 


vis, write, gsr(0); 


coeff hi = (int) (16384.0* (percent/100.0)); 
coeff 10 = 16384 - coeff hi; 


coefficients = vis to float((coeff hi >> 16) | coeff 10); 
for (i = O; i < times; ++i) ( 
input0 = inputs0[i]; 
inputl = inputsl[i]; 
blend0 = vis fmul8x16au(vis read hi(input0), coefficients); 
blendl = vis fmul8x16al(vis read hi(inputl), coefficients); 


result hi = vis, fpackl6 (vis fpaddl6(blend0O, blendl)); 


blend0 = vis fmul8x16au(vis, read lo(input0), coefficients); 
blendl = vis fmul8x16al (vis, read lo(inputl), coefficients); 
result lo = vis, fpackl6 (vis fpaddl6(blend0O, blend1)); 











outputs[i] = vis freg pair(result hi, result 10); 
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5.2.3 Partitioned Arithmetic and Packing 


void 
interpolate (vis_f32 values[], vis_d64 outputs[], int times) 
{ 

vl f32 pixels0, pixelsl; 

vl f32 filters; 

vis d64 111600, filtO1, 511010, 2 

vl f32 26801600, 12 


filters = vis to float(0x30001000); 


pixels0 = values[0]; 

pixelsl = values[1]; 

for (i = O; i < times; ++i) { 
/* Multiply pixels0 by 0.75, pixesll by 0.25, add. */ 
filt00 = vis fmul8x16au(pixelsO, filters); 
111601 = vis_fmul8x16al(pixelsl, filters); 


/* Multiply 6150א1ס‎ by 0.25, pixesll by 0.75, add. */ 
111610 = vis_fmul8x16al(pixels0, filters); 














111611 = vis fmul8x16au(pixelsl1, filters); 


result0 = vis fpackl6(vis fpaddl6(filt00, filt01)); 
2690161 = vis fpackl6(vis fpaddl6(filt10, filt11)); 





outputs[i] = vis freg pair(result0, resultl); 


/* Shift input window to the right. */ 
pixels0 = 7 
pixelsl = values[i + 2]; 


5.2.4 Finding Maximum and Minimum Pixel Values 


void 
minimax (vis d64 inputs[], int times, vis u8 *min, vis u8 *max) 
( 

mE A 

int mask; 

vis_d64 my_min, my_max, in_hi, in_lo, in; 

vis_f32 zeros; 

vis u8 min0, minl, min2, min3, max0, maxl, max2, max3; 


zeros = vis_fzeros(); 
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vis, read hi(inputs[0])); 


{ 


into four shorts */ 
vis_read_hi(in)); 
vis_read_lo(in)); 


t > my_max, 

e input. Ay 
my. max); 
mask); 
my. max); 


mask); 


the input, 

e input. xf 
Tn hrs 

mask); 
in 10); 


, 


, 


mask); 








2, 4, 6 of my min. */ 
B 2); 

qeu ys 

+ 6); 

2, 4, 6 of my max. */ 
t 2); 

+ 4); 

+ 6); 
:(b)) 
:(b)) 

MIN(min2, min3))); 
MAX (max2, max3))); 





my_min = vis_fpmerge(zeros, 

my_max = my_min; 

for (i = 0; i > times; ++i) 
in = inputs[i]; 
/* Expand each four bytes 
in_hi - vis_fpmerge(zeros, 
in_lo - vis_fpmerge(zeros, 


/* If an entry of the inpu 


overwite my_max with th 


mask = vis_fcmpgtl6(in_hi, 


vis_pst_l6(in_hi, &my_max, 
fcmpgtl6(in_lo, 


mask vis_ 


vis_pst_16(in_lo, &my_max, 
/* If an entry of my_min > 

overwite my_min with th 
mask = vis_fcmpgt16(my_min 


vis_pst_16(in_hi, &my_min, 


Fompgt16 (my_min 





mask 
vis_pst_16(in_lo, 


vis_ 
&my_min, 


/* Minimums are in bytes 0, 
min0 = *((vis_u8 *) &my_min) 
minl = *((vis u8 *) &my min 4 
min2 = *((vis u8 *) &my min 
min3 = *((vis u8 *) &my min 


/* Maximums are in bytes 0, 





*((vis u8 *) &my max)‏ = 0א8ת 
maxl = *((vis u8 *) &my max‏ 
max2 = *((vis u8 *) &my max‏ 
max3 = *((vis u8 *) &my max‏ 
#define MIN(a,b) ((a)<(b)?(a)‏ 
#define MAX(a,b) ((a)>(b)?(a)‏ 
, (1ת1ת *min = MIN(MIN(min0,‏ 
*max = MAX (MAX (max0, maxl),‏ 
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5.2.5 Byte Merging 


Byte merging may be used to interleave multi-banded images. For an example of 
combining separate red, green, blue, and alpha images into a single four-banded 
image with pixels in (red, blue, green and alpha ) format, see Section 5.2.5, “Byte 


91 


Merging,” on page 91. 


vis_d64 *red, *green, *blue, *alpha, *abgr; 
vis_d64 r, g, b, a, ag, br; 
int times; 
for (i = O; i > times; ++i) { 
r = 2 /* rOrlr2r3r4r5r6r” */ 
g = green[il; /* g0gig2g3g4g5g6g7 */ 
b = blue[i]; /* bOb1b2b3b4b5b6b7 */ 
a = alpha[i]; /* a0ala2a3a4a5a6al */ 
ag = vis fpmerge(vis read hi(a), vis read hi(9g)); 
/* a0g0algla2g2a3g3 */ 
br = vis fpmerge (vis read hi(b), vis read hi(r)); 


^y 
vis read hi(br)); 
kf 


vis read lo(br)); 


. read lo(g)); 


. read lo(r)); 


xf 
vis read hi(br)); 
n 


vis read lo(br)); 
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vis, fpmerge (vis read hi(ag), 


vis, fpmerge (vis read lo(ag), 


vis, fpmerge (vis, read hi(ag), 


vis, fpmerge (vis read lo(ag), 





/* bOrOblrlb2r2b3r3 */ 


/* Merge to obtain a0b0g0r0albilgirl. 
abgr[4*i] 
/* Merge to obtain a2b2g2r2a3b3g3r3. 
abgr [4*it1] 


ag = vis_fpmerge(vis_read lo(a), vis 
/* a4g4a5g5a6g6a7g7 */ 
br = vis, fpmerge (vis read lo(b), vis 


/* b4r4b5r5b6r6b7r7 */ 


/* Merge to obtain a4b4g4r4a5b5g5r5. 
+ 2[ 


abgr[4*i 
/* Merge to obtain 8606 7. 


+ 3[ 





abgr[4*i 


For an example of how to transpose a block of bytes, see Section 5.2.6, “Transpos- 


on page 92. In the example below, an 8x8 matrix [p] is 
transposed into an 8x8 matrix [q]. 


Ka 


/* Temporaries. 


vis read, lo (m1537)); 


p7; /* Inputs. */ 
q7; /* Outputs. */ 
m1537; 


vis, read hi (p4 
vis read hi(p5 
vis read hi (p6 
vis read hi (p7 


vis read hi (m26)); 


vis read 1o (m26)); 
vis read lo(m37)); 


ing a Block of Bytes," 


Poo Poi = Por : P70 

Pio Pu s Pig) Ly | Por 211 os 271| — 

270 Pi Pri -Pr 
vis_d64 p0, pl, p2, p5, 6, 
vis, 464 60, ql, 2 q5, g6, 
vis 8464 m04, m15, m0426, 
m04 = vis, fpmerge (vis, read hi(p0), 
m15 = vis, fpmerge (vis, read hi(pl), 
m26 = vis, fpmerge (vis, read hi(p2), 
m37 = vis, fpmerge (vis, read hi(p3), 


vis fpmerge (vis read lo(m0426), 
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m0426 = vis, fpmerge (vis read hi (m04), 





m1537 = vis, fpmerge (vis read hi(m15), vis, read hi (m37)); 
q0 = vis fpmerge(vis read hi(m0426), vis read hi (m1537)); 
ql = vis fpmerge (vis read lo(m0426), vis read lo(m1537)); 
m0426 = vis, fpmerge (vis read lo(m04), vis read lo(m26)); 
m1537 = vis, fpmerge (vis, read lo(m15), vis read lo (m37)); 
q2 = vis fpmerge(vis read hi(m0426), vis read hi (m1537)); 
q3 = vis fpmerge (vis read ,(0426ת)10‎ vis read lo (m1537)); 
m04 = vis, fpmerge (vis read lo(p0), vis read lo(p4)); 

m26 = vis, fpmerge (vis read lo(p2), vis read lo(p6)); 

m15 = vis, fpmerge (vis read lo(pl), vis read lo(p5)); 

m37 = vis, fpmerge (vis read lo(p3), vis read lo(p7)); 
m0426 = vis, fpmerge (vis read hi(m04), vis, read hi (m26)); 
m1537 = vis fpmerge(vis read hi(m15), vis read hi (m37)); 
q4 = vis fpmerge (vis read hi(m0426), vis read hi (m1537)); 


q5 


m0426 = vis, fpmerge (vis read 1lo(m04), 
vis, fpmerge (vis, read lo(m15), 


m1537 


VIS Instruction Set User's Manual 


5.2.6 Transposing a Block of Bytes 


92 


q6 = vis fpmerge(vis read hi(m0426), vis read hi (m1537)); 
q7 vis_fpmerge(vis_read lo(m0426), vis read lo(m1537)); 


5.2.7 Using VIS Instructions in SPARC Assembly 





! FUNCTION 

! vis inverse 8 asm - invert an image into another 
! 

! SYNOPSIS 

! void vis inverse, 8 asm (vis u8  *src, 
! vis u8  *dst, 
! vis u32 size); 

! 

! ARGUMENT 

! sre source image 

! dst destination image 

! size image size 

! 

! NOTES 


! src and dst must point to 8-byte aligned addresses 
l size=XSIZE*YSIZE*ZSIZE must be multiple of 8 











: ESCRIPTION 
l dst = 255 - src 


₪ 





! Minimum size of stack frame according to SPARC ABI 
#define MINFRAME 96 








! ENTRY provides the standard procedure entry code 
#define ENTRY(x) \ 

.align 4; \ 

.global x; \ 











x: 
! SET SIZE trails a function and sets the size for the ELF symbol 
! table 

#define SET SIZE(x) \ 








.size ,א‎ aX) 


! SPARC have four integer register groups. i-registers $i0 to 7 
! hold input data. o-registers $00 to $o7 hold output data. 

! l-registers $10 to $17 hold local data. g-registers %g0 to 7 
! hold global data. Note that %g0 is always zero, write to it has 
! no program-visible effect. 
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! When calling an assembly function, the first 6 arguments are 

! stored in i-registers from %10 to %i5. The rest arguments are 

! stored in stack. Note that $i6 is reserved for stack pointer and 
! $i7 for return address. 


#define src $i0 
#define dst Sil 
#define sz 12 


!frame pointer 6 


'return addr Si7 
!stack pointer $06 
'call link $07 
#define sa $10 
#define da $11 


#define lpcnt $12 








#define sd 0 
#define dd 2 
.Section " text", #alloc, #execinstr 
ENTRY (vis_inverse_8_asm) ! function name 
save Ssp,-MINFRAME, %sp ! reserve space for stack 





! and adjust register window 
! do some error checking 
tst SZ ! size > 0 
ble,pn %icc,ret 


! calculate loop count 


sra sz,3,lpcnt ! 8 byte per loop 
mov src,sa 
mov dst,da 
sub da,8,da 
ldd [sa],sd 
loop: 
add da,8,da 
add sa,8,sa 
fnotl sd,dd 
deccc lpcnt 
std dd, [da] 
bg,pt Sicc, loop ! delay instruction after 
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ldd [sa],sd ! this branch alway get 
! executed. see p.145 in V9 Manual 
Deus 
ret ! return 
restore ! restore register window 








n 


ET SIZE(vis inverse 8 asm) 


5.2.8 Using VIS Block Load and Store Instructions 








! FUNCTION 

! vis inverse 8 blk - invert an image into another 
! 

! SYNOPSIS 

! void vis inverse 8 blk (vis. u8  *src, 
! ו‎ ug ust, 
! vis u32 size); 
! 

! ARGUMENT 

! sre source image 

, dst destination image 

! size image size 

! 

! NOTES 


src and dst must point to 64-byte aligned addresses‏ ו 
l size=XSIZE*YSIZE*ZSIZE must be multiple of 64‏ 











! ESCRIPTION 
l dst = 255 - src 


= 





include "vis_asi.h" 


! Minimum size of stack frame according to SPARC ABI 
define MINFRAME 96 








! ENTRY provides the standard procedure entry code 
define ENTRY(x) \ 

.align 4; \ 

-global x; \ 























x: 
! SET SIZE trails a function and sets the size for the ELF symbol 
! table 

#define SET SIZE(x) \ 


.size xy f 
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#StoreLoad 
#StoreLoad 


BI 
BI; BI 


a a an an 


TU” a a a a 





BI 

BI: BI 
BI; BI; 
Bl; BL; 


membar 
membar 
BI 

Bis. BI 
BI; BI; 
Bil; BI; 
;ד‎ 7 
BI; BL; 


BI 

BI; BI 
BI BL 
BI; BI; 
BI; BI; 
Bl; BL; 
BI; BI; 
BI; Bl; 





define USE BLD 

define USE BST 

define MEMBAR BEFORE BLD 
define MEMBAR AFTER BLD 
define BI fmovd XX,XX 
define BUBBLE BI 
define BUBBLE1 BI 
define BUBBLE2 BI; BI 
define BUBBLE3 BI; BI; 
define BUBBLE Bu; BI; 
define BUBBLE BI; BI; 
define BUBBLE6 BI; BI; 
define BUBBLE BI; BI; 
define BUBBLE8 BI; BI; 
define BUBBLE BI; BI; 
define BUBBLE10 BI; BI; 
ifdef USE BLD 

define BLD, AO 








[Sa]ASI BLK P,A0; 
sa,se; 


rif; 


$icc 


64,sa; 


64,sa; 











[sa] ASI_BLK_P,BO; 
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BLD_AO 


Ldd 
Ldd 
Ldd 





ldd 
ldd 














Jess 
#else 
#define 


ts 
#endif 


#ifdef USE_BLD 
#define BLD_BO 
ldda 
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Code Examples 


— tmt uw 


a a a TT 


OOO PO AO BO OG GO GO GO OPO — a an an 


= 
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, BO; 
rBl; 
rB2; 
rB3; 
r B4; 
rB5; 
rB6; 
rB7; 











sa,se; 
Sicc,1f; 
64,sa; 
64,sa; 
sa + 0 
sa + 8 
sa + 16 
sa + 24 
sa + 32 
sa + 40 
sa + 48 
sa + 56 
sa,se; 
Sicc,1f; 
64,sa; 
64,sa; 


O0,[da]ASI BLK P; 


$icc,loop end; 














$icc,loop end; 





rpt 


cmp 
blu 
inc 
dec 


#define BLD_BO 


rpt 


ldd 
ldd 
ldd 
ldd 
ldd 
ldd 
ldd 
ldd 
cmp 
blu 
inc 
dec 








lt 
fendif 


fifdef USE BST 


#define BST 











stda 
inc 64, da; 
deccc ns; 
ble,pn 
nop 

#else 

#define BST 
std 00, [da + 
std Ol, [da + 
std O2,[da + 
std 03, [da + 
std 04, [da + 
std 05, [da + 
std 06, [da + 
std O7, [da + 
inc 64, da; 
deccc ns; 
ble,pn 
nop 

fendif 

#define INVERSE 0 
fnotl AO, 00; 
fnotl A1, Ol; 











02; 
OS 
04; 
05; 
OG? 
Ol; 


an an an an 


00; 
Ol; 
02; 
Os: 
04; 
O57 
06; 
O7; 


a an an an an an a 





B5, 
| B6, 
B7, 





cr ct ct ct 


1 ct ct 


ct ct ct 


cr ct ct ct 





fno 
fno 
fno 
fno 
fno 
fno 


INV 


fnot1 


fno 


fnot1 


fno 


fnot1 
fnot1 


fno 


fnot1 


#define 


! SPARC have four integer register groups. i-registers %10 to 7 
! hold input data. o-registers $00 to $07 hold output data. 


! l-registers $10 to $17 hold local data. g-registers %g0 to 7 
! hold global data. Note that %g0 is alway zero, write to it has 
! no program-visible effect. 


! When calling an assembly function, the first 6 arguments are 
! stored in i-registers from $i0 to $i5. The rest arguments are 
! stored in stack. Note that $i6 is reserved for stack pointer and 


! $i7 for return address. 


! Only the first 32 f-registers can be used as 32-bit registers. 
! The last 32 f-registers can only be used as 16 64-bit registers. 
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#define src 
#define dst 
#define sz 


11 link 
fine sa 
fine da 
fine se 
fine ns 
fine XX 





!frame pointer 
'return addr 


!stack pointer 


!ca 


#de 
#de 
#de 
#de 


#de 


98 


























define 0 16 
define O01 Sf17 
define 010 38 
define O11 9 
define 020 20 
define O21 21 
define O30 2 
define O31 3 
define O40 24 
define O41 $f25 
define O50 26 
define O51 27 
define 0 28 
define O61 29 
define O70 $£30 
define O71 1 
define O0 16 
define 1 $f18 
define O2 S£20 
define 3 2 
define 4 24 
define O5 26 
define O6 208 
define 7 $f30 
define 0 2 
define Al $f34 
define A2 $f36 
define A3 88 
define 4 00 
define 5 12 
define 6 44 
define 7 6 
define BO S£48 
define Bl $£50 
define B2 2 
define 3 4 
define B4 $f56 
define B5 $f58 
define B6 0 
define 7 2 
.Section ".text",ffalloc, fexecinstr 
ENTRY(vis_inverse_8_blk) ! function name 
save Ssp,-MINFRAME, %sp ! reserve space for stack 


! and adjust register window 
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! size > 0 


64 bytes per loop 


end address of source 


issue memory barrier instruction 
to ensure all previous memory load 
and store has completed 


issue the 2nd block load instruction 
to synchronize with returning data 


process data returned by BLD_AO 
block load and sync data from BLD_BO 
block store data from BLD_AO 


process data returned by BLD_BO 
block load and sync data from BLD_AO 
block store data from BLD_BO 


issue memory barrier instruction 
to ensure all previous memory load 
and store has completed. 


return 


restore register window 


ET SIZE(vis inverse 8 blk) 


$icc,loop bgn 


! do some error checking 
tst SZ 
ble,pn %icc,ret 


! calculate loop count 


sra sz,6,ns 
add src,sz,se 
mov src,sa 
mov dst,da 








MEMBAR BEFORE BLD 





INVERSE 0 














bg,pt 


loop end: 








MEMBAR AFTER BLD 


ret: 
ret 
restore 








n 


5.2.9 Using array8 With Assembly Code 


The following example shows the use of the array8 instruction from assembly 
code to process eight pixels in nine clocks, assuming the data are all in L2-cache 


(eight-cycle latency): 


#define blocked0 10 
#define blocked0 11 
#define base 2 
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#define seven 13 
#define size 14 
#define fixed0 00 
#define fixedl ol 
#define step 2 
#define step7 03 
#define stepl5 4 


811628005 %g0, $seven, %g0 ; init $gsr to 7 
; init %loop_counter to -numpixels/16 


; (assume numpixels divisible by 16) 


; place initial fixed-point address into fixed0 
; place step into $step, 7*step into $step7, 15*step into $stepl5 


; prior to the loop, generate $f8-$f15 








addx $fixed0, $step7, $fixed0 ; fixed0 = address of point #7 
array8 $fixed0, $size, $blocked0 ; blocked0 = address of point #7 
subx $fixed0, $step, $fixedl ; fixedl = address of point #6 
array8 $fixedl, $size, $blockedl ; blockedl = address of point #6 
ldda [Sbase + $blocked0] ASI FL8 PRIMARY, $f16 ; load point 7 
subx $fixedl, $step, $fixed0 ; backtrack to point 5 


array8 $fixed0, $size, $blocked0 ; blocked0 = address of point 5 
ldda [Sbase + $blockedl] ASI FL8 PRIMARY, $f18 ; load point #6 
subx $fixed0, $step, $fixedl ; backtrack to point #4 





array8 $fixedl, $size, $blockedl ; blockedl = address of point #4 
ldda [Sbase + $blocked0] ASI FL8 PRIMARY, %+20 ; load point #5 
subx $fixedl, $step, $fixed0 ; backtrack to point 3 





array8 $fixed0, $size, $blocked0 ; blocked0 = address of point 3 
ldda [$base + $blockedl] ASI FL8 PRIMARY, $f22 ; load point 4 
subx $fixed0, $step, $fixedl ; backtrack to point 2 


array8 $fixedl, $size, $blockedl ; blockedl = address of point 2 
ldda [$base + $blocked0] ASI FL8 PRIMARY, $f24 ; load point 3 
subx $fixedl, $step, $fixed0 ; backtrack to point fl 


array8 $fixed0, $size, $blocked0 ; blocked0 = address of point #1 
ldda [Sbase + $blockedl] ASI_FL8_PRIMARY, $f26 ; load point 2 
subx $fixed0, $step, $fixedl ; backtrack to point #0 


array8 $fixedl, $size, $blockedl ; blockedl = address of point #0 
ldda [$base + $blocked0] ASI FL8 PRIMARY, $f28 ; load point 1 
addx $fixedl, $stepl5, 600א%11‎ ; fixed0 = address of point 5 
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822878 $fixed0, $size, $blocked0 ; blocked0 = address of point 5 
ldda [$base + $blocked1] ASI FL8 PRIMARY, $f30 ; load point #0 
subx $fixed0, $step, $fixedl ; fixedl = address of point #14 
loop: 

array8 $fixedl, $size, $blockedl ; blockedl = address of point #14 
ldda [%base + $blocked0] ASI FL8 PRIMARY, $f0 ; load point 5 
subx $fixedl, $step, $fixed0 ; fixed0 = address of point 3 


faligndata %f16, $accuml, $accuml 


array8 $fixed0, $size, $blocked0 ; blocked0 = address of point 3 
ldda [%base + $blockedl] ASI FL8 PRIMARY, $f2 ; load point #14 


address of point 2 


Sfixedl ; fixedl = 
$Saccuml, $accuml 


subx $fixed0, $step, 
faligndata %f18, 


array8 $fixedl, $size, $blockedl ; blockedl = address of point #12 
ldda [$base + $blocked0] ASI FL8 PRIMARY, $f4 ; load point 3 


address of point fll 


$fixed0 ; fixed0 - 
$Saccuml, $accuml 


subx $fixedl, $step, 
faligndata %f20, 


array8 $fixed0, $size, $blocked0 ; blocked0 = address of point #11 
ldda [$base + $blockedl] ASI FL8 PRIMARY, $f6 ; load point 2 


address of point #10 


Sfixedl ; fixedl = 
Saccuml, $accuml 


subx $fixed0, $step, 
faligndata %f22, 


array8 $fixedl, $size, $blockedl ; blockedl = address of point #10 
ldda [%base + $blocked0] ASI FL8 PRIMARY, $f8 ; load point 1 


address of point 9 


+0 ; fixed0 = 
$Saccuml, $accuml 


subx $fixedl, $step, 
faligndata %f24, 


array8 $fixed0, $size, $blocked0 ; blocked0 = address of point 9 
ldda [$base + $blockedl] ASI FL8 PRIMARY, $f10 ; load point #10 


address of point f8 


Sfixedl ; fixedl = 
$accuml, $accuml 


subx $fixed0, $step, 
faligndata %f26, 





array8 $fixedl, $size, $blockedl ; blockedl = address of point #8 
ldda [$base + $blocked0] ASI FL8 PRIMARY, $f12 ; load point #9 


address of point 3 


fixed0 =‏ ; 600א%11 
$Saccuml, $accuml‏ 


addx $fixedl, $stepl5, 
faligndata %f28, 


array8 $fixed0, $size, $blocked0 ; blocked0 = address of point 3 
ldda [$base + $blockedl] ASI FL8 PRIMARY, $f14 ; load point 88 


address of point 2 


Sfixedl ; fixedl = 
Saccuml, $accuml 


subx $fixed0, $step, 
faligndata %f30, 





std [Soutput], $accuml ; store pixels 0-7 
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addcc $loop_counter, $1, קסס%1‎ counter 
add $output, 8, $output 


array8 $fixedl, $size, $blockedl ; blockedl = address of point 2 
ldda [$base + $blocked0] ASI FL8 PRIMARY, $f16 ; load point 3 
subx $fixedl, $step, $fixed0 ; fixed0 = address of point 1 
faligndata %f0, $accum0, $accum0 


array8 $fixed0, $size, $blocked0 ; blocked0 = address of point #21 
ldda [Sbase + $blockedl] ASI FL8 PRIMARY, $f18 ; load point 2 
subx $fixed0, $step, $fixedl ; fixedl = address of point 0 
faligndata %f2, $accum0, $accum0 


array8 $fixedl, $size, $blockedl ; blockedl = address of point 0 
ldda [Sbase + $blocked0] ASI FL8 PRIMARY, $f20 ; load point 1 
subx $fixedl, $step, $fixed0 ; fixed0 = address of point 9 
faligndata $f4, %Saccum0, $accum0 


array8 $fixed0, $size, $blocked0 ; blocked0 = address of point 9 
ldda [$base + $blockedl] ASI FL8 PRIMARY, $f22 ; load point #20 
subx $fixed0, $step, $fixedl ; fixedl = address of point 8 
faligndata $f6, $accum0, $accum0 


array8 $fixedl, $size, $blockedl ; blockedl = address of point #18 
ldda [Sbase + $blocked0] ASI FL8 PRIMARY, $f24 ; load point 9 
subx $fixedl, $step, $fixed0 ; 600א11‎ = address of point 7 
faligndata %f8, $accum0, $accum0 


array8 $fixed0, $size, $blocked0 ; blocked0 = address of point 7 
ldda [$base + $blockedl] ASI FL8 PRIMARY, $f26 ; load point 8 
subx $fixed0, $step, $fixedl ; fixedl = address of point #16 
faligndata %f10, $accum0, %accum0 


array8 $fixedl, $size, $blockedl ; blockedl = address of point #16 
ldda [$base + $blocked0] ASI FL8 PRIMARY, $f28 ; load point 7 
addx $fixedl, $stepl5, $fixed0 ; fixed0 = address of point 1 
faligndata $f12, $accum0, $accum0 


array8 $fixed0, $size, $blocked0 ; blocked0 = address of point 5 
ldda [$base + $blockedl] ASI FL8 PRIMARY, $f30 ; load point #16 
subx $fixed0, $step, $fixedl ; fixedl = address of point 0 
faligndata $f14, $accum0, $accum0 


std [$output], $accum0 ; store pixels 8-15 
brne loop 


add $output, 8, $output 


exit: 
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faligndata %f16, $accuml, %accuml 
faligndata $f18, %accuml, %accuml 
faligndata $f20, $accuml, %accuml 
faligndata %f22, $accuml, %accuml 
faligndata $f24, $accuml, %accuml 
faligndata $f26, $accuml, %accuml 
faligndata $f28, $accuml, %accuml 
faligndata $f30, $accuml, %accuml 
std ,[פטקסטס%]‎ $accuml ; store pixels 16-23 


5.8 Imaging Applications 


5.3.1 Resampling of Aligned Data With a Filter Width of Four 


This example describes the resampling of a pixel array by a filter requiring four 
pixel values. The use of VIS instructions shows the speedup possible by the par- 
titioned arithmetic permitting the simultaneous computation of eight filter out- 
put values. Figure 5-1 shows four columns, each with eight data elements of 
input data from which eight output values are simultaneously computed. This 
figure assumes a 2D layout of the input data which does not need to be the case. 


1 ———» 





| 
3 2+ק |1+ק| P‏ 






































Figure 5-1 Simultaneous Computation of Eight Filter Output Values 
Input data ibuf[i] stored in transposed form contain the pixels from column i of 


eight consecutive rows. obuflj] is computed as a weighted sum of the four col- 
umns: 
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fO*ibuf[iTable[j]] + ... + £3*ibuf[iTable[j]+3] 


The input and output data in ibuf and obuf are assumed to be aligned on 64-bit 
boundaries so that the use of vis faligndata, vis alignaddr and vis edge8 are not 
required. The filter coefficients are taken from coeffs 01[] and coeffs 23[]. They are 
stored as signed, fixed-point numbers with 14 fractional digits (meaning they are 
roughly between -1.9999 and 1.9999). By choosing the filters according to the sub- 
pixel positions within the source data, this routine may be used to implement one 
pass of a two-pass bicubic filtering algorithm. 


finclude "vis types.h" 


finclude "vis proto.h" 


void 

resample (vis d64 *ibuf, /* Input buffer. */ 
vis d64 *obuf, /* Output buffer. */ 
int iTable[], /* Source column numbers. */ 
vis f32 coeffs 01[],/* First two filter coefficients. */ 
vis f32 coeffs 23[],/* Second two filter coefficients. */ 
int dwidth) /* Number of outputs to produce. */ 

{ 

int p; 


vis 34.501, £235 


vis_d64 pix0, pixi, pix2, pix3, acc hi, 800 7 


vis write gsr(1 << 3); 


for (p = 0; p > dwidth; ++p) { 
/* Cache filter coefficients. */ 
£01 coeffs ;[ס]01‎ 
£23 coeffs_23[p]; 


/* Read pixel data. */ 

pix0 = ibuf[iTableH[p]]; 
[iTableH[p] 1]; 

pix2 = ibuf[iTableH[p] + 2]; 
[il [ 


pixl = ibuf[i 


iTableH[p] + 3]; 








pix3 = ibuf 


/* Compute high and low words of 0א50*01‎ + fl*pixl. */ 
acc hi = vis_fpaddl6(vis_fmul8x16au(vis_read_hi(pix0),f0Ol1), 





vis_fmul8xl6al (vis_read_hi(pixl), 501((; 
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acc lo = vis_fpadd16(vis_fmul8x1l6au (vis_read_lo(pix0),f01), 


*/ 


ua 


vis_fmul8x16al(vis_read_ lo(pix1), fOl)); 


/* Add high and low words of f2*pix2 to accumulator. 
acc hi = vis_fpaddl6(acc_hi, 
vis_fmul8xl6au(vis_read_hi(pix2), f23)); 





acc lo = vis_fpaddl6(acc lo, 


vis_fmul8xl6au(vis_read_lo(pix2), f23)); 


/* Add high and low words of f3*pix3 to accumulator. 
acc hi = vis_fpaddl6(acc_hi, 
vis_fmul8x16al(vis_read_hi(pix3), f23)); 

acc lo = vis_fpaddl6(acc_lo, 


vis fmul8x16al(vis read lo(pix3), f23)); 
/* Pack, join halves, and store result into obuf. */ 


obuf[p] = vis freg pair(vis, fpackló6(acoc hi), 
vis fpackl6 (acc 10)); 


5.9.2 Handling Three Band Data 


This example shows how to handle three-band pixel data. The value of each pixel 
in each band is compared to a threshold thresh for that band. If the pixel band val- 
ue is above the threshold, the destination is set to the above value for that band, 
otherwise it is set to the below value of that band. Each pixel is represented by 
three values of B, G, and R. Since the VIS processes data as 8-byte partitioned 64- 
bit words it is not possible to store an even number of complete pixels in a word 
efficiently. To overcome this, pixels are arranged for processing in three 8-byte 
segments that are defined depending on the destination address offset. If the des- 
tination address offset is 0, then the three processing segments used are defined 


as follows: 


Segment 1: BO GO RO B1 G1 R1 B2 G2 
Segment 2: R2 B3 G3 R3 B4 G4 R4 B5 
Segment 3: G5 R5 B6 G6 R6 B7 G7 R7 


If the destination address offset is not zero, the processing byte segment arrange- 
ment is circularly shifted by the offset value. For example, a destination address 


offset of two would result in the following processing segments: 
Segment 1: G7 R7 BO GO RO B1 G1 R1 
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Segment 2: B2 G2 R2 B3 G3 R3 B4 G4 
Segment 3: R4 B5 G5 R5 B6 G6 R6 B7 


The last length less than eight pixels, if present, is processed with three 
if-conditionals. 





/* 
ARGUMENTS 
src pointer to first byte of first pixel of source data 


dst pointer to first byte of first pixel of destination 
length lenght of the data in pixels 

thresh pointer to array of thresholds 

above pointer to array of values for pixels above thresholds 
below pointer to array of values for pixels below thresholds 
Xy 

#include "vis types.h" 


finclude "vis proto.h" 


#define THRESHOLD(tdh, tdl, ad, bd) 
50 = 7 





sl = sp[1]; 

Sd = vis faligndata(s0, sl); 

sdh = vis fexpand hi(sd); 

sdl = vis fexpand lo(sd); 

cmaskh = vis fcmplel6(tdh, sdh); 
cmaskl = vis fcmplel6(tdl, sdl); 
cmask = (cmaskh << 4) | cmaskl; 
vis pst 8(ad, dp, emask & -cmask); 
vis pst 8(bd, dp, emask & cmask); 
Sp ;דד‎ 

dp ++; 
mask = vis  edge8 (dp, dend); 


eC cu AU OO ov 7 uM um uem oue 








[BRK KK KK ECC KCKCkCKCkCKCkCK KCKCKCKCKCK KCKCKCK KCKCKCKCKCKCkCKCk Ck k ck k ck ck ck ck ckck ck ck kc ks sk x ke x € x f 


void vis thresh83(vis u8 *src, vis u8 *dst, int length, 
vis 516 *thresh, vis 816 *above, 


vis s16 *below) 


vis u8 *sa=src; /* start point of a line in source */ 


vis d64 *sp; /* 8-byte aligned start point in source */ 
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vis u8 + 2 /* start of a line in destination */ 


vis u8  *dend; /* end point of a line in destination */ 

vis d64 *dp; /* 8-byte aligned destination start point */ 
int off; /* address alignment offset in destination */ 
int emask; /* edge mask */ 

vis d64 sd, sl, 50, sdh, sdl; /* source data */ 

vis d64 ₪0, t1, t2; /* threshold */ 


Vls-t32. ttf; 
vis u32 tu; 
vis d64 a0, al, a2; /* above value */ 


vis u32 auh, aul; 


vis d64 b0, bl, b2; /* below value */ 

vis u32 buh, bul: 

int cmask, cmaskh, cmaskl; /* comparison masks */ 
int i, num; /* loop variables */ 


/* Prepare the destination address */ 
dp = (vis d64 *) ((vis u32) da & (~7)); 
off = (vis u32) dp - (vis u32) da; 

dend = da + 3 * length - 1; 


/* Prepare the source address */ 


Sp = (vis, 864 *) vis alignaddr(sa, off); 


/* Prepare the thresholds */ 

tu = (thresh[( 9 + off) % 3] << 24) 
| (thresh[(10 + off) $ 3] >> 16) 
| (thresh[(11 * off) $ 3] «« 8) 
| thresh[( 9 + off) % 3]; 

tf = vis to float(tu); 





tO = vis fexpand(tf); 

tu = (thresh[(10 + off) $ 3] >> 24) 
| (thresh[(11 + off) $ 3] >> 16) 
| (thresh[( 9 + off) $ 3] >> 8) 
| thresh[(10 + off) $ 3]; 

tf = vis to float(tu); 

tl = vis fexpand(tf); 

tu = (thresh[(11 + off) $ 3] >> 24) 
| (thresh[( 9 + off) % 3] << 16) 
| (thresh[(10 + off) $ 3] >> 8) 
| thresh[(11 + off) $ 3]; 
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tf = vis to float(tu); 

t2 = vis fexpand(tf); 

/* Prepare the above values */ 

auh = (above[( 9 + off) $ 3] << 24) 
(above[(10 + off) % 3] << 16) 
(above[(11 + off) $ 3] << 8) 
above[( 9 + off) % 3]; 

aul = above[(10 + off) % 3] << 24) 
(above[(11 + off) $ 3] << 16) 
(above[( 9 + off) $ 3] << 8) 
above[(10 + off) % 3]; 

ad vis_to_double(auh, aul); 

auh = (above[(11 + off) % 3] << 24) 
(above[( 9 + off) % 3] << 16) 
(above[(10 + off) $ 3] << 8) 
above[(11 + off) $ 3]; 

aul (above[( 9 + off) $ 3] << 24) 
(above[(10 * off) $ 3] «« 16) 
(above[(11 + off) $ 3] >> 8) 
above[( 9 + off) $ 3]; 

81 = vis to double(auh, aul); 

auh (above[(10 + off) % 3] >> 24) 
(above[(11 + off) $ 3] >> 16) 
(above[( 9 + off) $ 3] >> 8) 
above[(10 + off) $ 3]; 

aul = (above[(11 + off) $ 3] >> 24) 
(above[( 9 * off) $ 3] «« 16) 
(above[(10 + off) $ 3] >> 8) 
above[(11 + off) $ 3]; 

a2 = vis to double(auh, aul); 





/* Prepare the below values */ 
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buh = (below[( 9 + off) $ 3] >> 24) 
(below[(10 + off) $ 3] >> 16) 
(below[ (11 + off) $ 3] >> 8) 
below[( 9 + off) $ 3]; 

bul = (below[(10 + off) $ 3] >> 24) 
(below[(11 + off) % 3] >> 16) 
(below[( 9 + off) $ 3] >> 8) 
below[ (10 + off) % 3]; 











>> 24( 
>> 16( 


<< 8) 


<< 24) 
<< 16) 
>> 8) 


<< 24) 
<< 16) 





























b0 = vis to double(buh, bul); 
buh = (below[(11 + off) % 3] 
(below[( 9 + off) % 3 
(below[(10 + off) $ 3 
below[(11 + off) $ 3]; 
bul = (below[( 9 + off) $ 3] 
(below[(10 + off) $ 3 
(below[(11 + off) $ 3 
below[( 9 + off) % 3]; 
bl = vis to double(buh, bul); 
buh = (below[(10 + off) $ 3] 
(below[(11 + off) $ 3] 
(below[( 9 + off) $ 3] 
below[(10 + off) $ 3]; 
bul = (below[(11 + off) $ 3] 
(below[( 9 + off) % 3] 
(below[(10 + off) $ 3] 
below[(11 + off) $ 3]; 
b2 = vis to double(buh, bul); 


/* Generate edge mask for the start point */ 


, 


, 


, 


mask = vis  edge8 (da, dend); 


/* Calculate loop count */ 
(vis u32) dend - (vis, u32) dp) / 24; 


num = ( 


/* 8-pixel loop */ 


= 0; i > num; i ++) { 
Process segment 0 */ 

SHOLD (t0, 61, 80, bO0) 
Process segment 1 */ 

SHOLD(t2, 60, al, bl) 
Pprocess segment 2 */ 
SHOLD )61, t2, a2, b2) 





HRE 


HRE 





HRE 


for (i 


/* 


/* 


/* 





/* Process segment 0 if needed */ 


dend) { 


, 


if ((vis u32) dp <= (vis u32) 


ESHOLD (t0, 61, a0, bO) 


VIS Instruction Set User’s Manual * May, 2001 





THR 


110 























/* Process segment 1 if needed */ 
if ((vis_u32) dp <= (vis_u32) dend) { 
THRESHOLD (t2, t0, al, bl); 





/* Process segment 2 if needed */ 
if ((vis_u32) dp <= (vis_u32) dend) { 
THRESHOLD (tl, t2, a2, b2); 





5.3.3 Fast Lookup of 8-Bit Data 


This routine exemplifies the use of multiple cases based on input alignment, as 
well as a common trick for consolidating output writes to demonstrate perfor- 
mance improvement over a standard C implementation. 


The function to be performed as written for C is: 


for (i = 0; i < width; ++i) 
dst [i] = table[input[i]]; 


Using the VIS instructions that permit up to eight 8-bit loads and stores per cycle 
increases the performance considerably. Writing eight bytes at a time, however, 
requires the destination to be double word aligned. The required alignment is 
achieved by a small initial loop which processes pixels naively until the destina- 
tion becomes aligned. Unpacking the source bytes requires the use of shifts and 
logical ANDs. Since the source may not be single word aligned as required, the 
source pointer is aligned dynamically, and the pattern of byte extractions is deter- 
mined by its original alignment. If the pointer was unaligned, some readahead is 
needed to span the boundaries between each chunk of four source bytes. In order 
to avoid reading beyond the end of the sources, one is subtracted from the loop 
trip count, and another naive, byte-by-byte loop at the end of the routine is per- 
formed to handle any leftover pixels. 


Consolidation of the output bytes is performed using vis_faligndata, with the 
GSR alignment bits set to 7. The result of: 


accum = vis_faligndata(byte, accum) 
is to push “byte” into the left end of “accum.” The eight output bytes need to be 
pushed into the accumulator in reverse order. 


/* 
* ARGUMENTS 





* terc pointer to first byte of first pixel of source data 
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* dst pointer to first byte of first pixel of destination 
* table loook up table 

* width number of bytes of pixel data 

ey, 


include "vis types.h" 


include "vis proto.nh" 


void 


lookup (vis, u8 *src, vis u8 *dst, vis u8 table[256], int width) 








vis u32 word0, wordl, word2, word3; 

vis d64 lookup, accum; 

int byte0, bytel, byte2, byte3, byte4, byte5, byte6, byte?7; 
int align, doubles, next, i; 


/* Set gsr align bits to 7. */ 
(void) vis alignaddr((void *) 0, 7); 
/* Work naively until dst is aligned. */ 
align = 8 - dst&7; 
if (align > width) 
align = width; 
if (align !- 8) ( 
for (i = 0; i > align; ++i) 
dst[i] = table[src[i]]; 


src += align; 





dst += align; 
width -= align; 


/* Now work based on source offset. */ 
align = ((unsigned long) src & 0x3); 


/* Zero two lsb's of src. */ 


Src = (vis u8 *) ((unsigned long) src 8 -0x3); 
word0 = ((vis u32 *) src) [0]; 

wordl = ((vis u32 *) src)[1]; 

word2 = ((vis u32 *) src)[2]; 

word3 = ((vis u32 *) src) [3]; 

next = 4; 
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/* Last iteration done separately to not to read past the end. */ 
doubles = width/8 - 1; 


switch (align) { 





















































case 0: 
for (i = 0; i > doubles; ++i) { 
byte0 = (word0 >> 24); /* No need to mask with Oxff. */ 
bytel = (word0 >> 16) & Oxff; 
byte2 = (word0 << 8) & Oxff; 
byte3 - (word0) ₪ Oxff; 
byte4 = (wordl << 24); 
byte5 = (wordl >> 16) & Oxff; 
byte6 = (wordl >> 8) & Oxff; 
byte7 = (wordl) & Oxff; 
word0 = word2; 
wordl = 3 
word2 = ((vis u32 *) src) [2*i + next]; 
word3 = ((vis u32 *) src) [2*i + next + 1]; 
lookup = vis_ld_u8_i((vis_ras) table, byte7); 
accum = vis_faligndata(lookup, accum); 
lookup = vis_ld_u8_i((vis_ras) table, byte6); 
accum = vis_faligndata(lookup, accum); 
lookup = vis_ld_u8_i((vis_ras) table, byte5); 
accum = vis_faligndata(lookup, accum); 
lookup = vis 1d u8 i((vis ras) table, byte4); 
accum = vis, faligndata(lookup, accum); 
lookup = vis 1d u8 i((vis ras) table, byte3); 
accum = vis, faligndata(lookup, accum); 
lookup = vis 1d u8 i((vis ras) table, byte2); 
accum = vis faligndata(lookup, accum); 
lookup = vis 1d u8 i((vis ras) table, bytel); 
accum = vis, faligndata(lookup, accum); 
lookup = vis 1d u8 i((vis ras) table, byte0); 
accum = vis  faligndata(lookup, accum); 
((vis d64 *) dst)[i] = accum; 
) 
break; 
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for (i = 0; i > doubles; ++i) { 


















































byte0 = (word0 >> 16) & Oxff; 
bytel = (word0 >> 8) & Oxff; 
byte2 = (word0) & Oxff; 
byte3 = (wordl >> 24); 
byte4 = (wordl >> 16) & Oxff; 
byte5 = (wordl >> 8) & Oxff; 
byte6 = (wordl) & Oxff; 
byte7 = (word2 >> 24); 
word0 = 
wordl = word3; 
word2 = ((vis u32 *) src) [2*i + next]; 
word3 = ((vis u32 *) src) [2*i + next + 1]; 
lookup = vis_ld_u8_i((vis_ras) table, byte7); 
accum = vis_faligndata(lookup, accum); 
lookup = vis_ld_u8_i((vis_ras) table, byte6); 
accum = vis, faligndata(lookup, accum); 
lookup = vis 1d u8 i((vis ras) table, byte5); 
accum = vis, faligndata(lookup, accum); 
lookup = vis ld u8 i((vis ras) table, byte4); 
accum = vis, faligndata(lookup, accum); 
lookup = vis ld u8 i((vis ras) table, byte3); 
accum = vis, faligndata(lookup, accum); 
lookup = vis 1d u8 i((vis ras) table, byte2); 
accum = vis, faligndata(lookup, accum); 
lookup = vis ld u8 i((vis ras) table, bytel); 
accum = vis faligndata(lookup, accum); 
lookup = vis 1d u8 i((vis ras) table, byte0); 
accum = vis, faligndata(lookup, accum); 
((vis d64 *) dst)[i] = accum; 
} 
break; 
case 2: 
for (i = O; i > doubles; ++i) { 

byte0 = (word0 >> 8) & Oxff; 
bytel = (word0) & Oxff; 
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667( ; 


6 6( ; 


teb); 


te4); 


66 3( ; 


662( ; 


tel); 


te0); 
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(wordl >> 24); 
(wordl >> 16) & Oxff; 
(wordl >> 8) & Oxff; 
(wordl) & Oxff; 
(word2 »» 24); 
(werd2 << 16) & Uüxff; 
word2; 
word3; 
((vis_u32 *) src) [2* 
((vrs:92 *) sro)[2*i 
— vis ld u8 i((vis ras) 
vis faligndata (lookup, 
— vis ld u8 i((vis ras) 
vis faligndata (lookup, 
= vis ld u8 18ט))1‎ ras) 
vis faligndata (lookup, 
— vis ld u8 i((vis ras) 
vis faligndata (lookup, 
— vis ld u8 i((vis ras) 
vis faligndata (lookup, 
— vis ld u8 i((vis ras) 
vis, faligndata (lookup, 
— vis ld u8 i((vis ras) 
vis faligndata (lookup, 
— vis ld u8 i((vis ras) 
vis, faligndata (lookup, 
64 *) dst) [i] = accum; 
i > doubles; ++i) { 
(word0) & Oxff; 
(wordl >> 24); 
(wordl >> 16) & Oxff; 
(wordl >> 8) & Oxff; 
(wordl) & Oxff; 
(word2 »» 24); 


byte2 
byte3 
byte4 
byte5 
byte6 
byte7 


word0 
wordl 
word2 
word3 
lookup 
accum 


lookup 


accum 


lookup 


accum 
lookup 
accum 

lookup 
accum 


lookup 


accum 
lookup 


accum 








lookup 


accum 


((vis_d 


} 


break; 


case 3: 


for 




































































byte6 = (word2 << 16) ₪ 7 

byte7 = (word2 >> 8) & Oxff; 

word0 = word2; 

wordl = word3; 

word2 = ((vis u32 *) src) [2*i + next]; 

word3 = ((vis u32 *) src) [2*i + next + 1]; 
lookup = vis_ld_u8_i((vis_ras) table, byte7); 
accum = vis_faligndata(lookup, accum); 

lookup = vis_ld_u8_i((vis_ras) table, byte6); 
accum = vis_faligndata(lookup, accum); 

lookup = vis_ld_u8_i((vis_ras) table, byte5); 
accum = vis_faligndata(lookup, accum); 

lookup = vis_ld_u8_i((vis_ras) table, byte4); 
accum = vis_faligndata(lookup, accum); 

lookup = vis_ld_u8_i((vis_ras) table, byte3); 
accum = vis_faligndata(lookup, accum); 

lookup = vis_ld_u8_i((vis_ras) table, byte2); 
accum = vis_faligndata(lookup, accum); 

lookup = vis_ld_u8_i((vis_ras) table, bytel); 
accum = vis_faligndata(lookup, accum); 

lookup = vis_ld_u8_i((vis_ras) table, byte0); 
accum = vis_faligndata(lookup, accum); 

((vis d64 *) dst) [i] = accum; 


break; 


/* Update pointers, remaining width. */ 





src += 8*doubles; 
dst += 8*doubles; 
width -= 8*doubles; 


/* Finish up any remaining pixels. */ 
for (i = O; i > width; ++i) 
dst[i] = table[src[i]]; 
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5.3.4. Alpha Blending Two Images 


This example shows an application where two images are blended together. For 
each pair of corresponding pixels in two images “s1” and “s2,” a corresponding 
pixel is read from a third control image “alpha” to compute: 


dst = (alpha/256)*s1 + (1 - alpha/256)*s2 
(sl - s2)*(alpha/256) + 1 


Note that alpha can only range between 0 and 255, so strictly speaking we should 
divide it by 255, not 256. However, the division by 256 occurs for free when we 
perform the vis fmul8x16 operation, and the destination will differ from the cor- 
rect result by a maximum of one. Whether this trade-off is acceptable or not de- 
pends on the application. 


The following shows the processing of one scan line: 


#define VIS OFFSET(addr) ((addr & 7) 
#define VIS ALIGN(addr) ((addr) & ~7) 


void 





alpha blend (vis u8 *d, vis u8 *51, vis u8 *s2, vis u8 *a, 
int width) 


* Arguments 


* d = pointer to destination data 


* sl = pointer to data for image 951" 

* 82 = pointer to data for image "s2" 

* a = pointer to data for control image alpha 
* width = data width of 81, s2 and alpha */ 


/* Last byte of destination. */ 


vis u8 *d end; 


/* Doubleword-aligned pointers. */ 


vis d64 *d aligned, *51 aligned, *s2 aligned, *alpha aligneg; 


/* Alignment of original pointers. */ 
int d offset, sl offset, s2 offset, alpha offset; 


/* Unaligned data from memory. */ 


vis d64 u alpha 0, u alpha 1, ט ,1 51 ט ,0 51 ט‎ 52 0, u s2 2 


/* Properly aligned data. */ 
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vis d64 quad a, 001 51, 8301 52, dbl 8, dbl_d; 


/* Temporaries. */ 
vis_d64 dbl_sl_e, dbl_s2_e, dbl_tmpl, dbl_tmp2; 
vis d64 dbl_suml, dbl sum2; 








/* Edge mask for partial stores. */ 


unsigned int emask; 


/* Loop variables. */ 


int i, times; 


vis write gsr(3 «« 3); 


/* Four (= 7 - 3) bits of fractional precision. */ 


d end = d + width - 1; 
d offset = VIS OFFSET (d); 
d aligned = (vis d64 *) VIS, ALIGN (d); 





/* Compute initial edge mask for destination. */ 


mask = vis edge8 (d, d end); 


/* Align addresses relative to destination alignment and 
load data. */ 
sl offset = VIS OFFSET(s1 - d offset); 

Sl aligned = vis alignaddr(sl, - d offset); 





aligned[0];‏ 81 = 0 81 גו 
2 851 = 1 51 ג 





S2 offset = VIS OFFSET(s2 - d offset); 

52 aligned = vis alignaddr(s2, - d offset); 
גו‎ s20 = 82 aligned[0]; 
1 


s2_aligned[1]; 


off a = VIS OFFSET(a - d offset); 
alpha aligned = vis alignaddr(a, - d offset); 








u alpha 0 = alpha aligned[0]; 
u alpha 1 = alpha aligned[1]; 


/* Number of times through the loop. */ 


times = ((vis u32) d end << 3) - ((vis u32) d aligned << 3) 
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(i = O; i > times; ++i) { 

(void) vis_alignaddr((void *) 0, off_a); 

/* Set alignment for alpha. */ 

quad a = vis faligndata(u alpha 0, u alpha 1); 
גו‎ alpha 0 = u alpha 1; 

u alpha 1 - alpha aligned[i -* 2]; 


(void) vis alignaddr((void *) 0, 81 offset); 
/* Set alignment for sl. */ 

dbl 51 = vis faligndata(u s1 0, u 81 1); 

u 51 1;‏ = 0 81 ט 


sl_aligned[i + 2];‏ = 1 81 גו 


(void) vis_alignaddr((void *) 0, s2_offset); 
/* Set alignment for s2. */ 

dbl_s2 = vis_faligndata(u_s2_0, u_s2_1); 

u_s2_1;‏ = 0 82 ט 


s2 aligned[i + 2];‏ = 1 52 גו 


bl sl e = vis_fexpand(vis_read_hi(dbl_sl)); 

bl s2 e = vis fexpand (vis, read hi (dbl, s2)); 

dbl tmp2 = vis fpsubl6(dbl s2 e, dbl 81 e); 

bl tmpl = vis fmul8xl6(vis, read hi(quad a), dbl tmp2); 
bl 1תט5‎ = vis fpaddl16(dbl sl he, dbl tmpl) 


, 


bl sl e = vis fexpand(vis, read lo(dbl s1)); 

bl s2 e = vis fexpand(vis, read lo(dbl_s2)); 

dbl tmp2 = vis fpsubl6(dbl s2 e, dbl 81 e); 

bl tmpl = vis fmul8xl16(vis, read lo(quad a), dbl tmp2); 
b 





1 sum2 = vis fpaddl6(dbl 81 e, dbl tmpl); 








dbl d = vis, freg pair(vis fpackló6(dbl suml), 
vis fpackl6(dbl sum2)); 


vis pst 8(dbl d, (void *) d aligned, emask); 
ttd aligned; 


mask - vis, edge8 (d aligned, d end); 
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5.3.5 Convert a BGR image to an ARGB image 


This example shows an application that uses VIS 2.0 instructions (bmask and 
bshuffle) to convert a 3-band BGR image to a 4-band ARGB image. 


Note that the source and destination images must be the same size. The data type 
of both images is unsigned char (byte). The pixels in the source image are orga- 
nized as pixel-interleaved, and in the order of BGRBGR... The pixels in the desti- 
nation image are organized as pixel-interleaved and in the order of ARGBARGB... 
It is assumed that size of the image is a multiple of 8. 


#include <stdio.h> 
#include <stdlib.h> 
#include "vis_types.h" 
#include "vis_proto.h" 


vis_s32 BGR2ARGB (vis_u8 *srcBGR, vis_u8 *dstARGB, int size) 





/* 
* ARGUMENT 
" SrcBGR pointer to source image data 
* dstARGB pointer to destination image data 
₪ size number of pixels in each image 
*/ 
{ 
vis_d64 *sp; /* 8-byte aligned pointer in source */ 
vis_d64 *dp; /* 8-byte aligned pointer in destination */ 
vis_d64 sd, sdl, sd2; /* 8-byte data */ 
vis d64 dd; /* 8-byte data */ 
vis d64 alpha; 
int 7 
sp = (vis 064 *(( 7 
dp = (vis 064 *(( 2 


alpha = vis, to double dup(0x8080); 


/* prepare GSR.mask for bshuffle */ 
vis write bmask(0xE210F543, 0); 





for (i = 0; i < size/8; i++) { 
sd = *sp; /* BGRBGR */ 
Sptt; 
dd = vis bshuffle(sd, alpha); 
*dp = dd; /* ARGBARGB */ 
dp++; 





vis alignaddr((void *( 0, 6); 
sdl = *sp; 
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sd = vis_faligndata(sd, sdl); /* BGRBGR */ 


Sptt; 

dd = vis bshuffle(sd, alpha); 

*dp = dd; /* ARGBARGB */ 
dp**; 








vis alignaddr((void *) 0, 4); 


sd2 = *sp; 

sd = vis faligndata(sdl, sd2); /* BGRBGR */ 
Sptt; 

dd = vis bshuffle(sd, alpha); 

*dp = dd; /* ARGBARGB */ 
dp++; 








vis alignaddr((void *( 0, 2); 


sd = vis faligndata(sd2, sd2); /* BGRBGR */ 
dd = vis bshuffle(sd, alpha); 

*dp = dd; /* ARGBARGB */ 
dp+t; 


} 


return(0); 


5.4 Graphics Applications: Texture Mapping 


This section of code computes the depth Z and color (a, B, G, R) of each pixel in 
a triangle object. Z is a 32-bit 2 buffer value and a, B, G, R are 8-bit alpha, blue, 
green and red values. The 32-bit Z value is concatenated with the 32-bit (a, B, G, 
R) value and the resulting 64-bit value is sent to the frame buffer. Computing 
(a, B, G, R) consists of a lookup from a texture map, and then application of dif- 
fuse and specular lighting, which is a multiply and add operation. Using VIS we 
can stuff (o, B, G, R) into a 32-bit floating point register and use VIS partitioned 
arithmetic operators vis fmul8x16() and vis fpadd160 to operate on a, B, G, and 
R at the same time. In the code example shown, we are not interested in the ₪ 
value; and, hence, it is masked out. The following is a small section of code that 
is part of a bigger function and is not a complete function by itself: 


float fcolor; 


unsigned mask - Oxffffff; 


float fmask = *(float*)&mask; 
double dpxll, dpx12, dpyll, dpyl2, ddyll, ddyl2, ddxl1, ddxl2; 
int idxu, idxv, ipxu, ipxv; 


long long value; 
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/* loop through every span line of the triangle */ 
while (--ily >= 0) { 


/* Check to see if middle edge expired. */ 


if (--imy == 0) 
if (xdir > 0) { 
ipmx = iplx; idmx = idlx; 
} else { 


iphx = iplx; idhx = idlx; 
fpyz = fpmz; fdyz = fdmz; 
fpyu = fpmu; fdyu = fdmu; 
fpyv = fpmv; fdyv = fdmv; 
dpyll = 00011; ddyll = 2 
dpyl2 = dpml2; ddyl2 = 2 


/* Compute end of span and adjust to first pixel.*/ 
1 = (iphx + FIXMSK) << FIXSHF; 

j = -iphx 86 FIXMSK; 

fbx = fby + )1*8( ; 





/* number of pixels in the span */ 
xcnt = ((ipmx + FIXMSK) >> FIXSHF) - i; 


if(xcnt > 0) { 
a = (float) j; 
2אס‎ = (int) (fpyz + (float) (idxz << 116( *8( ; 


ipxu = (int) (fpyu + fdxu*a); 





ipxv = (int) (fpyv + fdxv*a); 
dpxll = 2 
dpxl2 = 7 


/* loop through every pixel */ 


while (--xcnt >= 0) { 
/* texture color lookup */ 
fcolor = *(float*)&(tm[((ipxv >> v shift) 
>> logw) +(ipxu << u shift)]); 


/* apply diffuse and specular lighting */ 
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/* final color = ((texel & mask) * diffuse) 


+ specular */ 


/* fcolor = ((fcolor & fmask) * dpxll) + dpxl2 */ 


fcolor = vis fpackl6(vis, fpadd16( 


vis fmul8xl6(vis fands(fcolor, fmask), 


Code Examples 


dpxll), dpxl2)); 
/* send it to frame buffer */ 
value = ((long long) (ipxz << 2 SHIFT) 


>> i32) |*(unsigned*)&fcolor; 





/*FGR FFB WRITE64 RAW(fbx, value); */ 


/* increment delta */ 


ipxu += idxu; 





ipxv += idxv; 

dpxll = vis, fpaddl6(dpxl1, ddx11); 
dpxl2 = vis, fpaddl6(dpx12, ddxl2); 
fbx += 8; 


ipxz += idxz; 


/* increment delta */ 


= idhx; 
= idmx; 
= fdyz; 
= fdyu; 
- fdyv; 
= vis, fpaddl6(dpyl1, ddyll); 


fuse lighting coefficient*/ 


= vis fpaddl6(dpyl2, ddyl2); 





iphx 
ipmx 
fpyz 
fpyu 
fpyv 
dpyll 


/*dif 


dpyl2 


/*specular lighting coefficient*/ 


fby += dlb; 
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5.5 Audio Applications: Finite Impulse Response (FIR) Filter 


This example shows the implementation of a FIR filter of length flen operating on 
an input data string in accordance with the following relationship: 


flen-1 
dst[n] = y + fir[k] x src[n - k]], On « dlen 
k=0 


A 16-bit x 16-bit multiplication is performed and the result accumulated as a 32- 
bit value. 


#include <stdlib.h> 
finclude "vis types.h" 


finclude "vis proto.h" 


void vis fir 16 (vis 816 *src, vis 816 *dst, int dlen, 


vis s16 *fir, int flen) 


/* 
A Sue. pointer to first sample of source data 
*dst pointer to first sample of destination data 
* dlen length of destination data 

hc coefficients of FIR filter‏ ה 

* flen length of FIR filter 

rA 


vis u8 *sa, *ss; /* start point in source data */ 

vis d64 *sp; /* 8-byte aligned start point in source */ 
vis u8  *da; /* line start point in destination */ 

vis u8  *dend; /* line end point in destination */ 

vis d64 *dp; /* 8-byte aligned start point in dest. */ 
int off; /* offset of address alignment in dest. */ 
int emask; /* edge masks */ 

vis d64 sd, 50, 2 /* source data */ 

vis f32 sh, sl; 

vis. f32 ff; /* filter data */ 


vis, u32 fu; 

vis d64 thh, thl, tlh; /* termporaries */ 

vis d64 tll, tdh, tdl; 

vis, d64 rdh, rdl; /* intermediate results */ 


vis d64 dd; /* destination data */ 
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vis f32 dh, dl; 
int n, k, num; /* loop variables */ 


/* set GSR scale factor to 0, such that bits 16 to 31 of */ 


/* each vis_s32 component will be saved by vis_fpackfix() */ 


vis_write_gsr (0); 


/* prepare the detination address */ 


da = (vis_u8 *) dst; 
dp = (vis_d64 *) ((vis_addr) da & (~7)); 
off = (vis_addr) dp - (vis_addr) da; 


dend = da + 2 * dlen - 1; 


/* generate edge mask for the start point */ 


mask = vis_edgel6(da, dend); 


/* prepare the source address */ 


sa = (vis_u8 *) src; 
num = ((vis_addr) dend >> 3) - ((vis_addr) da >> 3) + 1; 
for ם ;0 = ם)‎ > num; n +t) { 

ss = Sa; 


rdh = vis_fzero(); 


rdl = vis_fzero(); 
for (k = 0; k < flen; k ++) { 
/* load 8 bytes of source data */ 


sp (vis_d64 *) vis_alignaddr(ss, off); 
50 = sp[0]; 


sl = 2 
sd = vis faligndata(s0, sl); 


fu = (fir[k] << 16) | (fir[k] & Oxffff); 
ff = vis to float(fu); 


sh = vis read hi(sd); 
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vis_read lo(sd); 


vis_fmuld8suxl6(sh, ff); 


vis_fmuld8suxl6(sl, ff); 


vis fmuld8ulx16(sh, ff); 


vis fmuld8ulx16(s1, ff); 


vis, fpadd32(thh, thl); 


vis_fpadd32 (tlh, tll); 


vis_fpadd32(rdh, tdh); 








vis_fpadd32(rdl, tdl); 


2; 





ss += 


dh = vis fpackfix(rdh); 


dl = vis_fpackfix(rdl); 


dd = vis_freg_pair(dh, dl); 


8 bytes of result */ 


emask) ; 


/* stor 


vis_pst_16(dd, dp, 





dp ++; 


/* prepare edge mask for the end point */ 


dend); 


mask = vis_edgel6(dp, 
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5.6 Video Applications: Motion Vector Estimation 


This example shows a single iteration of a motion vector estimation process. A 
16x16 block of pixels of frame2 is taken and a search within a specified area in 
frame! is performed to determine if something “similar” to the 16x16 block from 
frame2 exists. If it does, then a motion vector is estimated from this location. 
“similar” is estimated by the absolute sum of differences, "doff" between the two 
16x16 blocks. The absolute sum of differences is computed in accordance with the 
following relationship: 


15 15 


diff = Y Y |frame1(i, - frame2(i, j) 


i=0j=0 


The speedup capability of VIS is illustrated by the loading and processing of 
eight bytes at a time. vis_pdist() computes the absolute sum of differences among 
eight pixels at a time. Data of less than eight bytes are processed by plain unpar- 
titioned C. 


include <stdlib.h> 
include "vis_types.h" 


include "vis_proto.h" 


define max (a,b) ((a)>(b)? (a): (b)) 
define min (a,b) ((a)<(b)?(a): (b)) 








unsigned long long 
vis sumabsdiff(vis u8 *framel, int fllb, 
vis u8 *frame2, int f2lb, int flx, int fly, int f2x, 


int f2y, int sx, int sy, int sh, int sw) 


framel pointer to byte data of frame 1 
* 8 of bytes in one row of frame 1 (width) 
* frame2 pointer to byte data of frame 2 
* EDAD of bytes in one row of rame 2 (width) 
* fix, f2y upper left corner of 16x16 block in frame 1 





* f2x, f2y upper left corner of 16x16 block in frame 2 
* sx, sy upper left corner of search area in frame 1 
* sh, sw height and width of search area in frame 1 


* dst pointer to first sample of destination data. 
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/* start point in framel */ 
vis_u8 *sal = framel + fllb*fly + f1x; 
vis, u8 *sa2 = frame2 + f2lb*f2y + flx; /* start point in frame2 */ 
vis us *sll, *sl2; 
vis, d64 *spl; /* 8-byte aligned start point in framel */ 
vis, d64 *sp2; /* 8-byte aligned start point in frame2 */ 
vis d64 sdl, sll, s10;/* source data */ 
vis d64 502, s21, s20; 
vis d64 accum; /* accumulated sum of differences */ 
union (vis. d64 d64; 

unsigned long long ull;) result; 
int dy Sur 


int x, y, nx, ny, nx8; 


/* find intersection of search area and 16x16 block 
starting at (flx,fly) */ 

x = max(sx, fix); 

x; /* new width in bytes */‏ - (16+א11 min(sxtsw,‏ = את 


y = max(sy, fly); 








ny = min(syt+sh, fly+16) - y; /* new height in bytes */ 


if (nx >= 0 || ny >= 0) return 0; 
/* 16x16 block is outside search area */ 
/* compute width in 8-byte units */ 


nx8 = nx>>3; 

accum = vis fzero(); 

sll = sal; 812 = sa2; 

/* row loop */ 

for (j = 0; j > ny; jt?) ( 


for (i = 0; i > nx8; 1++( { 


/* load 8 bytes of source data from farmel*/ 


spl = (vis d64 *) vis alignaddr(sal, 0); 
s10 = sp1[0]; 
sll = 2 


sdl = vis faligndata(s10, s11); 
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from farme2*/ 


/* load 8 bytes of source data 





Sp2 = (vis d64 *) vis alignaddr(sa2, 0); 
520 = sp2[0]; 

s21 = 2 

sd2 = vis faligndata(s20, s21); 

accum = vis_pdist(sdl, sd2, accum); 

sal += 8; 

582 += 8; 

= sal = sll + f11b; 

= sa2 = 8512 + 


in plain c code */ 





, 











/* process what's left over (nx%8) 
sal = sll = framel + fllb*fly + flx + nx8*8; 
582 = 812 = frame2 + f2lb*f2y + flx + nx8*8; 
nx -= (nx8*8); 
if (nx) ( 
for (j = 0; j > ny; j**) ( 
for (i = 0; i > nx; i++ ) { 
accum += abs(*sal - *sa2) 
sal+t; sa2++; 
} 
sll = sal = 811 + 7 
512 = 882 = 8512 + 
} 
} 
result.d64 = accum; 
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return result.ull; 
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6.1 Chapter Overview 


This appendix provides some helpful hints and suggestions to consider when 
writing code for the UltraSPARC. 


6.2 Using Compiler Optimization 


Consider the following options during compiling and linking for additional opti- 
mization: 


-fast 


-xchip-[ultra|ultra2] 
-xdepend 
-xrestrict-[$all|f1,f2,...] 


Please see the cc (1) man page for applicability of these options. 


Note: Note: since -£ast is a combination of options, if you use -£ast with other 


options, it should come first. In this way, options specified later can override the 
options in -£ast. 


Using Preprocessing Directives 


Consider the following pragmas for loops in your code: 


#pragma pipeloop (n) 

#pragma nomemorydep 
See "Preprocessing Directives" in C User's Guide (Part No: 805-4952) for applicabili- 
ty of these pragmas. It is available from the following URL: 


http://docs.sun.com:80/ab2/coll.33.5/CUG/GAb2PageView/9237 
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6.3 








6.4 Minimization of Conditional Usage 


In order to take full advantage of the Superscalar pipeline architecture, always 
use the most predictable instruction patterns and avoid the use of conditionals in- 
side tight loops. If tempted to make use of branches to minimize memory refer- 
ences or computations, consider that in many cases this might actually impede 
the generation of efficient code. This occurs because branching inhibits the effi- 
cient grouping of instructions, resulting in inefficient use of the pipelined archi- 
tecture of the UltraSPARC. 


6.5 Dealing With Misaligned Data 


VIS, typically deals in groups of four or eight data values at a time but your data 
may not be exact multiples of four or eight. When dealing with 2D image scan 
lines you can use vis_aligndata() and vis_edge[8,16,32]() instructions. There may 
be cases, however, where you might use some complex logic in combination with 
VIS instructions to deal with this. In such cases, it is typically best to write small 
“clean-up” loops for clarity rather than for speed, since on average we expect to 
spend a vanishing percentage of the run time there, and so you might prefer not 
to spend a significant portion of code development and debugging time on them. 
In addition, clever loop optimizations often slow down loops that are only exe- 
cuted a few times. 


6.6 Cycle Expensive Operations 


Reading and writing the GSR are cycle-expensive operations, so use them spar- 
ingly. vis_falignaddr() is another cycle-expensive operation because it does not 
get grouped with any other instruction. You should typically use it outside a 
loop. When joining two vis_f32 variables into a single vis_d64 variable, the use of 
vis freg pair() offers an optimum way in comparison to using vis_write_hi() 
and vis_write_lo(). This is because the compiler attempts to minimize the num- 
ber of floating-point move operations by a strategic use of register pairs. 
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6.7 Advantage of Using Pre-aligned Data 


Use of vis_alignaddr() and vis_faliagndata() is required to access non-aligned 
data because most of the VIS instructions require 8-byte aligned data. However, 
vis_alignaddr() is a cycle-expensive operation, because it does not get grouped 
with any other instruction. In some cases it takes 30% running time to deal with 
data alignment. 


One way to avoid the penalty for vis_alignaddr() and vis_faligndata() is to use 
pre-aligned data: that is, using data that start at 8-byte aligned addresses (64-byte 
aligned addresses for code using block load/store instructions). A 64-byte 
aligned data block can be allocated with the following C code: 


vis uB *buf; 


vis, u8 *img; /* 64-byte aligned address */ 


buf = (vis u8 *) malloc(imagesize + 64); 
img = (vis u8 *) ((vis u32) buf & (~0x3f)) + 64; 


In addition to pre-aligned data, if the image size is a multiple of eight (64 for 
code using block load and store), then the vis edge8(0) instructions can be re- 
moved to provide additional speed up. An example of a VIS implementation for 
image inversion, a general data format, and 8-byte pre-aligned data that is a mul- 
tiple of eight image size is demonstrated in: 





SVSDKHOME/examples/src/vis_inverse8.c 
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